Where and how can we add value to users?
I think “log“ should be a generalized term, including at least:
Together, these represent the “status” of the target systems so that users can understand whether their systems are healthy or not, and if not, where the problem is and how to resolve it.
From the system’s perspective, there’s really no difference among these. They are all just “text” data, needed to be parsed/stored/indexed/aggregated and then fetched/searched/charted/analyzed.
In general, there are two different approaches:
Centrally managed collectors are usually proprietary, like SumoLogic’s (MII as well). SumoLogic’s central collector (source) management is able to change remote collectors’ config: target file(s) location, filters, tags, etc. The up side is, well, centralized management. You manage all collectors in one place (on UI). Don’t get it wrong, you still need to install collector on each host, just the configuration is done on the server.
The 2nd approach configures individual collector separately on the host it is installed. These vendors usually don’t implement their own collector but instead relying on commonly seen collectors like rsyslog/Flume/Fluentd/Logstash. All they need to do is support one common endpoint (syslog is the most common protocol) or create their own plugin for those collectors.
Flume/Fluentd/Logstash all have more features (strong plugin eco-system) than SumoLogic’s. Though the convenience of managing sets of collectors centrally is really an advantage. I’d argue that the ideal design is to be able to use those common collectors while still able to centrally manage them (to some extent). I don’t think these two approaches are mutually exclusive.
There are generally 2 types of log query implementation: Lucene based and Splunk/SumoLogic style extension.
The difference is Lucene based query only has, well, Lucene search functions. That basically means you can only search based on fields present in the Lucene index. Generally there’s no aggregation or analytics support (in the query). This category includes Solr, ElasticSearch, Loggly (among others) … You do have limited aggregation and also faceting, which are supported by all 3 products mentioned above, but not directly using the query. Mostly aggregation and faceting are through UI operation. You search something first then aggregating on the resultset, or faceting on the side which indirectly suggest possible filters to use next. This category generally requires UI operation in order to do meaningful aggregation or analytics. There’s a feel of “multi-step” in order to complete a task.
Splunk and Sumo Logic are in the second category. The “query” language supports aggregation and much more directly. You can still use UI to do things but in the end they got translated to a “query”. Also there’s “pipe” so you get to apply one after another operation which is very convenient and powerful. Another strong point is you can create new, “virtual” fields on the fly. For example, browser agent field in Apache access logs contains multiple information. With this you can “extract” the OS part only without actually having an OS field in the index (which would need schema change and re-index from Lucene’s perspective).
From the indexing perspective, the 1st feels more static while the 2nd more dynamic. Using the 1st, you have to re-index and re-processing if you need a new field. Using the 2nd, just add/change the query. But of course, the 2nd must suffer from some runtime overhead, which is the trade off. We could also argue that the 2nd require the more advanced knowledge of the query language.
I do not believe these are mutually exclusive. It would be great to combine the good of both. In fact, Sumo Logic does mention it and plan to add finer control on specifying static field.
All products I’ve tried using saved search for creating alerts. When creating an alert, you specify the criteria of matching records with a query as a saved search, as well as how often this should be checked. The highest frequency of running alert searches is 1 minutes (of all products I’ve tried), probably out of performance consideration. Usually there’s an additional control on how many matches within a specific time window would be treated as an alert. For example, only more than 5 matches (of the alert’s saved search query) in the past 60 minutes would trigger an alert.
SumoLogic does support real time alert based on saved search (“human real time“ as it is explained, in the range of 8-12 seconds), though in real time mode it can only check up to the latest 15 minutes time window. Also some search language operator is not allowed in real time mode.
Using saved search for alert provides the most flexibility, as whatever you can do with search is available to alerting. From implementation perspective, you don’t need to create a separate logic for alert evaluation which is nice. However, performance is a problem if the alert checking runs too often.
One question needs to be answered is:
How to dynamically add fields into index? Assuming we are going to use Lucene based solution (Solr or QRadar).
Dynamically change and reload schema definition does seem feasible (without restart of course). This needs further investigation depends on what we use.
The other one, re-indexing, is easily done with Kafka. The ultimate source of truth should be on HDFS, which means we’ll need to first “normalize” input data from Kafka and persist into HDFS, then pour back to Kafka in another topic. When replaying, we simply read from HDFS and re-pouring back to Kafka. This could be done in Storm, with separate bolts for Kafka and HDFS persistence.
Obviously, we search for a problem or root cause of a problem. But upon a “hit”, here should be a way to pull in all related info. One possible dimension is time. Another might be call trace.
What is the best way of receiving into Kafka on the internet?
Analytics on the simple access log from Apache. Of course, these scenarios are meant to be put on a dashboard.
Unique visitor requires the ability to “combine” fields. OS or browsers requires virtual field derived from a field.
Data goes to both batch and speed layer. Batch layer contains the master dataset, serving layer generates query views, and speed layer provides query views of the most recent data which batch/serving layer has not yet caught up.
The batch layer runs functions over the master dataset to precompute intermediate data for your queries, which are stored in batch views in the serving layer. The speed layer compensates for the high latency of the batch layer by providing low latency updates using data that has yet to be precomputed into a batch view. Queries are then satisfied by processing data from the serving layer views and the speed layer views and merging the results.
A key/value store providing random and realtime read/write access.
The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
It does not provide random access (read/write).
Think of it as an higher abstraction for using MapReduce with SQL-like method.
Hive is useful only if the data is structured.
* | transaction host cookie
event=1 host=a
event=2 host=a cookie=b
event=3 cookie=b
Group (cross-source) events together (transitive or not) for further aggregation.
Maximo logs:
SessionId -> ActionId/time -> SqlId/time
n
lowest actions taking more than t
sec.t
secp
% timeif (<cond>, <true>, <false>) as <field>
User expectation
There are 3 main goals:
I want to dig into 2 and 3 first. They are more likely the areas a system can differentiate from others.
Conceptually, there are 2 steps in using such a system to resolve problems: 1) identify the problems, 2) resolve the problems.
Of course, users can manually search for problems (based on his experience). But a superior system would “magically” identify potential problems to users via alerts. That’s the selling point. We’ll talk about that point later. Lets assume a potential problem has been identified, and we have the exact timestamp of when and what the problem is.
Now back to the abnormal detection. In addition to user alerts, there are also automatic detection:
Then there’s situation users would like to manually search to identify problems. We could do something to assist in this process:
Arguments against Custom Collector and Central Management
Even though central management has its up side, there might be some concern:
I think these are more likely the concerns for large shop.
BTW, also need to consider receiving metrics protocol used by most commonly used time-series metrics system. Ex. Graphite/StatsD. We want more than logs.
This is not an elegant design, too “hacky”. There are too many “state” to maintain, inconsistent between client and server side. It might be worthwhile to just enhance an open source collector to allow server controlled configuration. Users can simply add our plugin (trivial thing to Logstash and Fluentd). The critical piece is to see if existent plugin framework allows configuration reloading.
Limitation of Lucene based query language
Simply looking for something is not a problem. The problem is with correlating different records in the result set (that is, aggregation and analytics).
Take the experiment scenarios on Apache access log. Loggly (mainly Lucene search only) can only “partially” achieve 6 out of 9 scenarios mostly because it can only aggregate and sort on one dimension. All 6 achievable scenarios by Loggly are something like “top xxx sorted by yyy…”. That translates to search by “xxx” then order by “yyy” (order by is through the UI, not search language). Even with 6, the result cannot include fields other than “xxx” (so it’s only for charting purpose, like pie chart). The needed UI operation is not yet available on Loggly.
The other 3 scenarios are all using so called “unique visitors”, which is defined by the combination of 3 fields: date, ip, and agent. Here, Lucene/Solr does not support virtual or derived field hence unless we change the schema to add 3 additional fields and re-index we cannot search on “unique visitors”.
Similarly, 2 of the 3 scenarios also use “partial” field info from field “agent” which contains both OS and browser info. If we want to aggregate on OS, we cannot unless we add a new field to contain only the OS part of the original “agent” field.
I think since this is not provided by Lucene search engine, it should be the upper layer’s responsibility, either Solr or the application. Splunk/SumoLogic query language is one way to support that (on top of search result set), while it is quite possible to achieve the same thing from UI operation. It is just that Loggly has not yet reached that level.
I would say using query language is much more flexible and elegant from implementation perspective. Combined with “pipe” or stream concept, it is very powerful.
QRadar
From my understanding, QRadar does it in the reverse way. Instead of running alerts periodically, it runs all alert checking on each event received. Since events must must go through CRE before persisted into Ariel, there must be some latency added to ingestion process.
On the other hand, running saved search does not impact ingestion. The only contention is if the query (read) does impact the write to the underlying data store, and how much.
If using search based alerts, we can’t afford to do it in QRadar’s way, that is, running all alert queries on every single event received (essentially, QRadar is not using search to implement alert).
However, QRadar’s alerting does allow a few interesting and powerful concepts, which cannot be covered by search based approach (at least not directly):
Search based is the way to go:
In addition to user defined alerts, there is another category of anomaly detection alerts. Though users should be able to turn on/off these alerts, even possibly tweak the algorithm used for their needs. This category of alerts should by default turned on.
The first and the entry point of our pipeline should be Kafka. For a few reasons:
Note that Kafka is not actually the entry point of the whole system as it currently does not have security. There is another light layer in front of Kafka responsible for receiving from collectors and publishing to Kafka queues. This light layer could simply be Logstash or Fluentd behind DNS and/or load balancer.
This is where the source of truth stored. It is immutable and normalized. Each atomic piece of data should:
Enforceable and Evolvable Schema
Data nodes
EP could have additional data nodes which shard the data as well as spread the search workload. Adding data nodes is dynamic.
ECS
Adding ECS on the other hand is not really dynamic (distribution in that sense). Data stays to where it was processed, which means users need to manually distribute the sending among multiple ECS.
On the UI, it is a 2nd search mode: “Quick Filter”.
Use Lucene library for free-form search capabilities inside raw Ariel payload data
Solr adds functionality like
No automatically resharding or dynamic adding replicas
Fields cannot be updated/added easily/efficiently
The most basic form of faceting shows only a breakdown of unique values within a field shared across many documents (a list of categories, for example), Solr also provides many advanced faceting capabilities. These include faceting based upon
Solr also allows for hierarchical and multidimensional faceting and multiselect faceting, which allows returning facet counts for documents that may have already been filtered out of a search result.
I found having a single source of stream data is great in that we could have multiple, many consumers: APM, Alerting, QRadar, Analytics …
But then, each would need to be stream processing system. However, how to keep “history” available for the processing purpose?
If we keep the canonical/atomic data in Kafka as the source stream, each consumer then would be equal in that they process by their own. If any failure, they just replay from Kafka. There’s no need to read from Hadoop from other systems for example .
Plus, any derived stream would also be available if needed by some systems.
We still need a kind of snapshot/restore facility so we don’t have to keep complete log in Kafka.
Storm processing output is better written back to Kafka instead of directly to data stores (say Cassandra). This decouples Strom from the underlying data store.
http://www.slideshare.net/Hadoop_Summit/analyzing-12-million-network-packets-per-second-in-realtime (p26)
Kafka Spout
How do you implement stateful stream processing in Storm?
http://www.slideshare.net/Hadoop_Summit/analyzing-12-million-network-packets-per-second-in-realtime (p41)
(2011)
(2013/09)
(2014/08)
See What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope
scenarios done: 2/3/4/5 (no table form, just chart), 8/9
Provisioning takes about 1 min.
Built on Angular, load page very slow … Also, it could be network but I don’t feel it fast, in fact it feels very slow, like when faceting.
Automatically apply the Summarize operator to the current results (non-aggregate).
The Summarize operator’s algorithm uses fuzzy logic to cluster messages together based on string and pattern similarity. You can use the summarize operator to quickly assess activity patterns for things like a range of devices or traffic on a website. Focus the summarize algorithm on an area of interest by defining that area in the keyword expression.
Reduce logs into patterns, works across multiple sources. By identifying the change of log patterns, say a new pattern emerges, it is able to do anomaly detection.
allows cross event tracing of transactions based on a common field, with states defiend:
... | transaction on sessionid
with "Starting session *" as init,
with "Initiating countdown *" thresh=2 as countdown_start,
with "Countdown reached *" thresh=2 as countdown_done,
with "Launch *" as launch
results by transactions showing max(_messagetime),
sum("1") for init as initcount
also able to draw flow chart directly:
_sourceCategory=[sourceCategoryName]
| parse "] * - " as sessionID
| parse "- * accepting request." as module
| transaction on SessionID with states Frontend, QueryProcessor, Dashboard, DB1, DB2, DB3 in module
results by flow
| count by fromstate, tostate
Like materialized search, pre-aggregated result.
You can create a partition specific for a search sub-range. for performance.
all 9 scenarios can be done.
it is very much like Splunk, which does all the processing at search phase. This would typically cause long duration especially for aggregation type operation. Loggly style usually come back in less than 5 sec.
It does support field extraction at collection phase. Define the parse
operator as a rule and it would be applied automatically. This combines the best of both world.
provisioning not fast …
collector setup is not flexible. requires installation of its collector.
now I know why IBM is looking for it. enterprise…
More developer centric mindset. It doesn’t have rich feature set. Much of its functions are basic “tail” like operation.
Analytics side, it does not have any ‘native’ function. It allows daily archive to S3 and then provides some scripts/instructions for loading archives into different tools/envs for self analytics work. Ex. Hive or Redshift.
One interesting tool allows tails locally the centralized logs.
A Ruby shop.
Tried local Banana/Solr. To be honest, it’s not fast. Any action on the page take maybe at least 1 second. (This may be due to local Solr slowness, not sure.)
Also, sometimes the page just “locks”.
What I like about Banana (Kibana I assume):
Searches and Reports
TPAE UI session > 150
sourcetype=tpae eventtype="tpae_total_users_and_jvm_ui_sessions" OR eventtype="tpae_total_users" OR eventtype="tpae_jvm_ui_sessions"|timechart span=15m max(ui_sessions) as sessions by host|untable _time host sessions|search sessions > 150|timechart first(sessions) by host
TPAE UI session tracks
correlation_id=* ElapsedTime session_id=*|table session_id correlation_id _time user_id elapsed_time app event|sort session_id correlation_id
TPAE memory available
sourcetype=tpae eventtype="tpae_memory_status"|stats sparkline(avg(eval(memory_available/1024/1024))) as memory_available, avg(eval(memory_available/1024/1024)) as avg, min(eval(memory_available/1024/1024)) as min, max(eval(memory_available/1024/1024)) as max by host
TPAE slow query by app, object
sourcetype=tpae eventtype="tpae_slow_sql"|stats count, avg(sql_time) as avg_exec_time, max(sql_time) as max_exec_time, min(sql_time) as min_exec_time by app, object|sort 20 - count, - avg_exec_time
TPAE stacked memory available and used
sourcetype=tpae eventtype="tpae_memory_status"|timechart span=15m avg(eval(memory_available/1024/1024)) as avg_available, avg(eval((memory_total - memory_available)/1024/1024)) as avg_used
TPAE top mbo count
sourcetype=tpae eventtype="tpae_mbo"|stats sparkline(avg(mbo_count)), avg(mbo_set_count) as set_count, avg(mbo_count) as count by mbo|sort 20 - set_count
TPAE ui sessions' correlations
correlation_id="*" ElapsedTime|stats first(elapsed_time) as time first(user_id) as user first(app) as app first(session_id) as session_id by correlation_id|stats count(correlation_id) as count first(time) as time first(user) as user first(app) as app by session_id|sort 10 - time
TPAE users and session
sourcetype=tpae eventtype="tpae_total_users_and_jvm_ui_sessions" OR eventtype="tpae_total_users" OR eventtype="tpae_jvm_ui_sessions" | timechart span=15m first(users) as total_users, avg(ui_sessions) as per_jvm_sessions
Master dataset must be atomic, immutable, and perpetual.
Database requirements:
https://www.bro.org/brocon2014/brocon2014_grutzmacher_opensoc.pdf (p23)
Algorithms for batch and streaming processing.
It’s bad for end users to be the first to notice this.
if resp_time > 2 sec
alert!
end
That is easy.
This is better.
We should detect this for users.
Seasonal patterns, longer period underlying trend.
Simple rules, advanced rules with states, really complex rules with lots of calculations …
In time, even better before the incidents.
We need complete information.
With the identified incident’s time:
multi-level facet across separate fields
Normal with message volume over time:
Drag/drop a facet onto it becomes:
Learning
(from goaccess)
A few scenarios requires more than just search with simple aggregation/sort.
(Others are mostly achieved using group-by.)
Daily unique visitors:
* | parse using public/apache/access
| timeslice 1d
| sum(size), count by src_ip, _timeslice, agent
| sum(_sum) as size, count by _timeslice
Time size count
12-26-2014 00:00:00 62,277,020 15,070
12-27-2014 00:00:00 447,010 100
(SumoLogic syntax)
Top operating systems sorted by unique visitors:
* | extract "(?<src_ip>\S+?)
\S+ \S+ \S+ \S+ \"[A-Z]+ \S+
HTTP/[\d\.]+\" \S+ \S+ \S+
\"(?<agent>[^\"]+?)\""
| if (agent matches "*Win*", "Win", "0") as OS
| if (agent matches "*Linux*", "Linux", OS) as OS
| where OS != "0"
| timeslice 1d
| count by OS, src_ip, _timeslice, agent
| top 10 OS
(SumoLogic syntax)
| timechart span=15m
max(ui_sessions) as sessions by host
| untable _time host sessions
| search sessions > 150
| timechart first(sessions) by host
| stats
sparkline(avg(eval(memory_available/1024/1024))) as memory_available,
avg(eval(memory_available/1024/1024)) as avg,
min(eval(memory_available/1024/1024)) as min,
max(eval(memory_available/1024/1024)) as max
by host
| timechart span=15m
avg(eval(memory_available/1024/1024)) as avg_available,
avg(eval((memory_total - memory_available)/1024/1024)) as avg_used
| stats
count,
avg(sql_time) as avg_exec_time,
max(sql_time) as max_exec_time,
min(sql_time) as min_exec_time
by app, object
| sort 20 - count, - avg_exec_time
| stats
sparkline(avg(mbo_count)),
avg(mbo_set_count) as set_count,
avg(mbo_count) as count
by mbo
| sort 20 - set_count
| timechart span=15m
first(users) as total_users,
avg(ui_sessions) as per_jvm_sessions
| stats
first(elapsed_time) as time
first(user_id) as user
first(app) as app
first(session_id) as session_id
by correlation_id
| stats
count(correlation_id) as count
sum(time) as time
first(user) as user
first(app) as app
by session_id
| sort 10 - time
Looking at the architecture, QRadar is composed of one Console and multiple managed hosts (in a distributed way). It is not clear (to me) how the management of this distributed system is done.
Source input uses in-memory queues for buffering. Understand events are dropped once queues are full (though it is not really desirable). What about fault-tolerant? If a host is down, are all events are lost?
What is the typical (or expected) “turn-around” time after an event is sent to Qradar and it is searchable on the UI?
Ariel schema seems to be static (normalized fields as they are called). If users want to parse their own log source, they need to create Log Source Extension (LSX). It seems only those normalized fields are usable in AQL.
How is free-text search’s Lucene index building fit into the pipeline?
It is only used for full text search, along with the Ariel query. Full text search filter first and then feed to Ariel search.
The pipeline …
Rules are defined with pre-defined tests, specifically for security domain. Some example pre-defined test:
So …
It is specifically for events/flows (at least originally). Not sure how suitable for arbitrary data. In comes down to whether it supports dynamic schema. Say ingestion has a new field, can this field be created in Ariel schema?
Of course, there are only 2 “tables” (events and flows) so naturally AQL does not have join.
The point of a custom query language (like Splunk/Sumo-Logic) over Lucene/Solr query language is to be able to conveniently “stream” operation, manipulate field(s), aggregate, etc.
Master in console, slaves (read-only) in all managed hosts
node process manager
Runs on each appliance in a deployment
pre-populate by-minute, by-hour, and by-day counts of data for efficient graphing
owns and is responsible for persisting our asset and identity models into the DB
responsible for communicating with third party vulnerability scanners (through our plugins), scheduling scans and processing results
receiving flows from a number of sources, creating flow records, and periodically sending these records to ECS for further processing. sources could be:
Responsible for the scheduling and execution of the “report runner”. The report runner generates the reports that the executor requests.
Data is from Ariel payload and feed to FTS. Essentially create one index file per time slot (default is 8h) and keeps N slots (configurable). Search can be parallel down to multiple slots (then aggregated). Indexing must be done and committed to the index file before a searcher instance can be created.
Hence the last, most recent slot is special in that it is using Lucene’s NRT search and has limitation in that indexer and searcher must be in the same Java process (performance).
Solr utilizes update log and hence is reading from the log uncommited change. Benefit? This real-time get might only be limited to get by unique key.
Why not using Solr, as it is together with Lucene in Apache.
Solr is pretty mature now, why build from scratch? any special feature?
Spark only has exactly once semantics and under some cases fall back to at least once.
Fault tolerance require HDFS backed data source (see). Putting data onto HDFS first adds some latency.
Spark streaming is good for iterative batch processing, mostly machine learning. Scaling is not there yet (for Yahoo), and tuning could be difficult. Streaming latency could be at least 1s, has single point of failure. All streaming inputs are replicated in memory.
That’s how Kafka and Storm entered.
At a glance, you would see a summary of the different kinds of exceptions that are currently occurring and the frequency of these exceptions between desktop vs mobile or across different browsers (kind of sub-grouping). It’s this instant and continually customized overview that is new and different. it allows you to get a much fuller view of your logs and to more quickly zero in on what’s important.
PatternDB to parse log messages and generate name-value pairs from them. These can then be forwarded to Loggly using the JSON output.
See https://www.loggly.com/blog/sending-json-format-logs-syslog-ng/
Should you send us a log format we don’t currently recognize, we’ll want you to tell us about it.
Provides a quick upload way (curl) to try first. Very simple and straight forward (though it’s even simpler to upload on the page I suppose).
Auto detected apache access log as the type.
Does seem to be syslog based (receiving end). In fact, all investors do support syslog end point.
Even file monitoring instruction uses syslog configuration method. Even log4j is instructed to be configured to send to local syslog daemon first.
From client side, you can also easily send logs via http/https, in plain text or json format.
Using tokens for security/identify. (https://logs-01.loggly.com/bulk/95179f81-503b-4d70-a2ed-728955261cc0/tag/file_upload)
Ingestion takes some time to show up on search page (~20s).
Source group: logical grouping on host/app/tag (AND). Used for like production/staging.
There doesn’t seem to be a way to re-parsing existent data and dynamically on UI. That is, to modify the parsing method (like add a new field like in Splunk), you’d have to re-sending data (after changing the collector).
json are automatically parsed and fields are automatically extracted. There are a few supported log formats, but other than that, you have to parse yourself and send in in json
In general, only has syslog and http/https endpoints.
One interesting thing: wikified strings are considered while indexing. That is, S3Stats.processStatEvents is indexed as “S 3 Stats process Stat Events”.
Search syntax similar/same to Lucene/Solr with minor extension. Like field:>40, exists:field:value, category range field:[xxx TO yyy]
Faceting is on the left side. It’s not very fast (feeling so slow might be my network). 2-level faceting is possible and clicking creates a filter. Multiple filtering is allowed. You can easily remove a filter in active.
I like the timeline google finance like operation.
Save current ‘search’ easily by “star” button.
Alert is based on saved search and run periodically (quickest is ran every minute). Can only alert if the saved search condition matched over some number in some past period, like over 5 times in the past 5 minutes. In addition to email, you only have http call back and page duty choices.
Charting/trending/grouping is done on UI and split into multi-step. First select chart type, then grouping accordingly. There’s also a grid view that you can compose your own table view freely.
Currently only timeline/pie/bar/single-value. Charting should really include grid/table form.
The ‘dimension’ in charting is fixed. That is, you cannot combine multiple fields. For example, unique visitors in Apache access logs requires combination of IP/Date/Agent. This is more fundamental to the faceting, which only allow faceting on existent field, no composition.
Also, there’s no control over the time unit. Saved widget used current time range as it is (which is not easy to use btw). Say I want to have a trend line on the unit of per day.
I like the tabbed interface, able to preserve a diagnosis session (most likely the time range filter).
When charting/trending, can’t do roll-up. Like group by then roll up some other columns (of course, it can only show single filed used as the group by). For example, top hosts also showing total bytes requests for each host.
Requires custom collector. Once collector is installed, it is managed via web UI. Sources could have different static tags for search purpose. Filters could also be defined on sources (ex/include, hash, mask).
Currently sources could be :
There’s another “hosted” option which provides HTTP/S and S3 endpoints.
Collector takes job from server, scheduled seems to be every 2 minutes. Other than that, ingestion speed is fine compared to Loggly (can’t really tell how slow or fast it is due to the lag).
The parsing is done during search time, just like Splunk.
It has pipe in search just like Splunk.
There’s a summary
operator which cluster messages based on pattern similarites. Ex.
error or fail* | summarize| parse "message: *" as msg | summarize field=msg
more …
count _size | top 10 _size by _count
Parsing apache access log:
* | parse using public/apache/access | where status_code matches "4*"
Actually more mature than Loggly.
I like the ‘Library’ overlayer that shows common/stock search categorized by application.
When I try to create a widget out of the unique visitors result, it failed as not supported operator.
Dashboard: for some reason, can’t get it to show data successfully.
Also via saved search. Even though on UI it says it could be real-time, but I think it is still running periodically.
Alert actions could be email, script, or save to index.
Basically syslogd. It also provides a small remote_syslog daemon.
Least feature reach tried so far. Essentially positioned as a centralized “tail” service. Minimum search function (basic boolean operators and full text). Also seems to only allow syslog messages.
Allow log filters to removes logs never wanted while ingestion.
Scheduled saved search. Support many web-hooks as notification.
The whole dashboard is row based, with each row containing multiple “panels”. You can specify the height of each row in pixel, width of panels in x/12 (like bootstrap). There are about 17 types of panel for you to choose, more like pre-defined panel for use. Ex. timepicker, query, facet, filtering, etc. Think of widget.
Each panel defines its own config section.
Some panels apply globally to the whole page’s panel (like timepicker and query).
“Display help message when Inspect” allows html, markdown, and text modes, with some content specified in panel config. For example, bettermap allows showing active query when inspect.
One pitfall is some faceting only can work on numeric fields. which is kind of inconvenient.
New Relic Insights is powered by a highly distributed cloud-hosted event database with an innovative architecture that does not require indexing.
Data is distributed across cluster nodes so each query can leverage the full power of the cluster to return billions of metrics in seconds.
The New Relic Insights database does not require indexing. This eliminates the need to configure the database in advance based on the types of queries expected. Query speed is fast regardless of what attributes you query.
New Relic Insights uses the SQL-flavored query language - NRQL (New Relic Query Language) for querying the events database.
The amount of data we collect every day is staggering. Initially all data is captured at full resolution for each metric. Over time we reperiodize the data, going from minute-by-minute to hourly and then finally to daily averages. For our professional accounts, we store daily metric data indefinitely, so customers can see how they’ve improved over the long haul.
Our data strategy is optimized for reading, since our core application is constantly needing to access metric data by time series. It’s easier to pay a penalty on write to keep the data optimized for faster reads, to ensure our customers can quickly access their performance data any time of the day.
Creates a database table per account per hour to hold metric data. This table strategy is optimized for reads vs. writes
Having so many tables with this amount of data in them makes schema migrations impossible. Instead, “template” tables are used from which new timeslice tables are created. New tables use the new definition while old tables are eventually purged from the system. The application code needs to be aware that multiple table definitions may be active at one time.
Data collection handled by 9 sharded MySQL servers.
The “collector” service digests app metrics and persists them in the right MySQL shard.
These back-end metrics are persisted in the customer’s New Relic agent for up to one minute, where they are then sent back to the New Relic data collection service
Real User Monitoring javascript snippet sends front-end performance data to the “beacon” service for every single page view.
An In-Depth Look at the New Relic Platform
Plugins are made up of two parts:
Alerting: AppDynamics computes your response time thresholds by itself and might take some time to learn your system. New Relic relies on custom thresholds defined by you for its Apdex index.
Receiving, processing (normalize & coalesce), running it against Rules and then storing the data to disk,
There are two separate paths for events and flows within the event collector. It is responsible for normalizing and parsing the received events.
There are multiple in-memory “input” queues (for different types of souce) that house incoming events. queues are based on “event counts” rather than memory usage, and as such, are limited to these set numbers to ensure we don’t consume all available memory. Once those queues fill, if more new events come in, they are dropped.
The Event Processor houses the Custom Rule Engine (CRE) and the storage filters for Flows and Events.
It is responsible for creating offenses and tabulating all the information associated with those offenses.
Query server is in Java, queuing system, tasks/subtasks based.
Append-only, immutable data files, with time interval based.
What’s good about Storm?
ElasticSearch index management
Use the right tech for the job. The main New Relic web application has always been a Rails app. The data collection tier was originally written in Ruby, but was eventually ported over to Java. The primary driver for this change was performance. This tier currently supports over 180k requests per minute and responds in around 2.5 milliseconds with plenty of headroom to go.
Extracting properties: event id, source ip, source port, dest ip, dest port, protocol, username, pre-nat ip/port, post-nat ip/port, ipv6 source/dest ip, source/dest mac, machine name.
If events repeat, then the events coalesce based on a multipart key, which uses source ip, dest ip, dest port, username & event id. Once 4 events come in that match the same key, the 4th event is coalesced with any additional events that match that same key, for the next 10 seconds. Once 10 seconds goes by, the counter is reset and coalescing for that “key” restarts.
autodetect log sources
Forward processed, parsed events to another QRadar deployment. This is typically used in “Disaster Recovery” (DR) deployments, where customers want to have a second console/installation that has a backup copy of production data.
configured in the Admin tab under the item “Routing Rules”. Options in the routing rules allow a user to filter/drop/modify events that are going through the pipeline.
This happens AFTER licensing, occurring between the EC and EP processes, so events that say, are marked to be dropped, STILL count against your EPS license.
The option in the ‘routing options” include:
The CRE runs on each event processor and monitors data as it comes through the pipeline. There are multiple types of rules, which check for either properties of single events, and functions which can track multiple events over a period of time or other “counter” based parameter.
The CRE is an asynchronous filter which performs:the processing of events and flows on all the rules currently active in the system. Some of the activities include:
The event streaming code bypasses normal ariel searching and uses a connection from the “ecs” service directly to tomcat.
This is the heart of our correlation abilities. The UI component allows a rich set of real-language based tests such as “and when the Source IP is one of ….” or “and when the payload matches the following regular expression”. Customers configure these rules to target specific incidents that they may be interested in. The output of the CRE is typically an offense, an e-mail, a system notification, an entry into a reference set, to forward the event on somewhere else, etc…
Rules are defined with a test preset, specifically for security domain. Rules have multiple variables for users to fill. A rule can have multiple tests, with or/and relationship.
One kind of test could be when a specific set of “rules” matched …
Rule action can be:
Rule response can be:
There could also be limiter on no more than N per X (time) per unit (rule/source/etc).
Global Rule
Rules that have counters or functions and are marked as “test globally”, if they have partial matches, their “partial matches” are sent up to the EP on the console for counting.
For example, if your rule says look for windows events and 5 login failures from the same ip and same username, and it’s marked test globally, each login failure for windows events is sent to the EP on the console from the EP that sees them. If there were 3 failures on one EP, then 2 on another, the central EP will get notification of all 5 and fire the rule.
When you have a rule that’s marked to test globally, it only really triggers if it’s a rule with a function. A rule that only tests a single event, even for multiple properties, is automatically NOT tested globally to avoid unneccessary load on the event pipeline.
For example if you had a rule looking for an event with user = x and source ip = y and event type is “login failed”, that will automatically NOT run a global test, because it’s only testing 3 properties of a single event. However, if you have a rule marked test globally, and you have a function looking for a count of events over a period of time, that’s called a partial match. When you have partial matches across multiple event processors, this is why we need to have the concept of “global rules”.
What happens then, is that event processor with the partial match, wll send that event ‘reference” to the central event processor on the console. Then, if another EP in another location also finds an event that matches, it also sends it to the console for global testing, and we can then do correlation across multiple event processors.
Global tests are turned -off- by default, currently, since the load on the central EP can be substantial, if alot of rules are tested globally.
We must store duplicates in GCC mode for magistrate to be able to properly handle pending offenses and partial matches of rules along the way. Essentially what happens is:
Searches WILL show both events because they are both there.
Magistrate can now create offenses and run pending offense queries to backfill offenses that had lead-up global events.