True Metrics
In yesterday's post we referred to getting data for the metrics from logs and database states because the information is pulled without affecting the operations of the system. However there are a few drawbacks to this approach.
First, most logs make their way to a time series database, also called a log index, and the log files are hardly used directly to compute the data for the metrics
Second, this flow of logs may be interrupted or just plainly missed. In some cases, this goes undetected that the logs are not making their way to a log index.
Third, within the time series database, there is no ownership of logs and consequently events can be deleted.
Fourth, the same goes for duplication of events which tampers with the data for the metrics.
Fifth, the statistics computed from a time series database depends on the criteria used to perform the search also called the query. This query may not always be accurate and correct for the metric to be reported.
These vulnerabilities dilute the effectiveness of a metric and therefore lead to not only inaccurate reports but possibly misleading charts.
Let's compare this with the data available from the database states.
The states of the services are internal and tightly controlled by the service.
They are transactional in nature and therefore can be counted as true representation of the corresponding events
They are up to date with the most recent activity of the service and hence provide information upto the nearest point of time.
Queries against the database are not only going to be accurate but likely used by the service itself when the service exports metrics directly.
The data is as authoritative as possible because the service relied on it
However, the drawback is that not all information is available in the database particularly those for error conditions. In such cases, the log files are the best option.
The metrics data may be pulled frequently and this might interfere with the operations of the database or the service depending on who is providing the information.
Therefore a solution is to keep the log file and the time series database in sync by controlling the processes that flow the data between them. For example, a syslog drain to the time series database that is setup once when the service is deployed guarantees that the log events will flow to the index. Users to the index may be given read only permissions.
Lastly log index have a way of exporting data suitable for use with external charting tools. Some log indexes also provide rich out of box charting. For example, Splunk lets us find the number of calls made against each of the http status codes with a nifty search command as follows:
search source=sourcetype earliest = -7d@d | timechart count by status
Since the 200 value is going to be the predominant value, it can be used as an overlay over the chart so that the other values are clearer, colored and easier to read.
If the status codes are not appearing as a separate field which is a common case, we can use the cluster command with something like
index=_internal source=sourcetype | cluster showcount=t | table _time, cluster_count, _raw | timechart count by cluster_count
or we can specify rex "HTTP/1.1\" (?<status>\d+)"
#puzzle question
You have a 4x4 matrix with the first row occupied the numbers 17, 99, 59 and 71.
Find an arrangement of the matrix such that their sum in any direction is equal.
Answer: There can be many answers possible if the occupied cells were all over the matrix. Instead if the first row is fixed, then we can fill in the remaining only in a way such that there are no duplicates row wise and column wise. This way all vertical, horizontal and diagonal elements add to the same sum.
#codingexercise
Given a sorted array of number , value K and value X, find the K nearest number to the value
void printKNearest(List<int> A, int x, int k)
{
int n = A.Count;
// returns one element from the array.
int left = GetTransitionPointByBinarySearch(A, 0, n-1, x);
int right = left + 1;
int count = 0;
if (A[left] == X) left --;
while (left >= 0 && right < n && count < k)
{
if (x-A[left] <A[right]-X){
Console.Write(A[left]);
left--;
}else{
Console.Write(A[right]);
right--;
}
}
while (count < k && left >= 0){
console.Write(A[left]);
left--;
}
while (count < k && right <= n-1){
console.Write(A[right]);
right++ ;
}
}
One of the key observations is that these metrics could be pushed directly from the api of the services.
In yesterday's post we referred to getting data for the metrics from logs and database states because the information is pulled without affecting the operations of the system. However there are a few drawbacks to this approach.
First, most logs make their way to a time series database, also called a log index, and the log files are hardly used directly to compute the data for the metrics
Second, this flow of logs may be interrupted or just plainly missed. In some cases, this goes undetected that the logs are not making their way to a log index.
Third, within the time series database, there is no ownership of logs and consequently events can be deleted.
Fourth, the same goes for duplication of events which tampers with the data for the metrics.
Fifth, the statistics computed from a time series database depends on the criteria used to perform the search also called the query. This query may not always be accurate and correct for the metric to be reported.
These vulnerabilities dilute the effectiveness of a metric and therefore lead to not only inaccurate reports but possibly misleading charts.
Let's compare this with the data available from the database states.
The states of the services are internal and tightly controlled by the service.
They are transactional in nature and therefore can be counted as true representation of the corresponding events
They are up to date with the most recent activity of the service and hence provide information upto the nearest point of time.
Queries against the database are not only going to be accurate but likely used by the service itself when the service exports metrics directly.
The data is as authoritative as possible because the service relied on it
However, the drawback is that not all information is available in the database particularly those for error conditions. In such cases, the log files are the best option.
The metrics data may be pulled frequently and this might interfere with the operations of the database or the service depending on who is providing the information.
Therefore a solution is to keep the log file and the time series database in sync by controlling the processes that flow the data between them. For example, a syslog drain to the time series database that is setup once when the service is deployed guarantees that the log events will flow to the index. Users to the index may be given read only permissions.
Lastly log index have a way of exporting data suitable for use with external charting tools. Some log indexes also provide rich out of box charting. For example, Splunk lets us find the number of calls made against each of the http status codes with a nifty search command as follows:
search source=sourcetype earliest = -7d@d | timechart count by status
Since the 200 value is going to be the predominant value, it can be used as an overlay over the chart so that the other values are clearer, colored and easier to read.
If the status codes are not appearing as a separate field which is a common case, we can use the cluster command with something like
index=_internal source=sourcetype | cluster showcount=t | table _time, cluster_count, _raw | timechart count by cluster_count
or we can specify rex "HTTP/1.1\" (?<status>\d+)"
#puzzle question
You have a 4x4 matrix with the first row occupied the numbers 17, 99, 59 and 71.
Find an arrangement of the matrix such that their sum in any direction is equal.
Answer: There can be many answers possible if the occupied cells were all over the matrix. Instead if the first row is fixed, then we can fill in the remaining only in a way such that there are no duplicates row wise and column wise. This way all vertical, horizontal and diagonal elements add to the same sum.
#codingexercise
Given a sorted array of number , value K and value X, find the K nearest number to the value
void printKNearest(List<int> A, int x, int k)
{
int n = A.Count;
// returns one element from the array.
int left = GetTransitionPointByBinarySearch(A, 0, n-1, x);
int right = left + 1;
int count = 0;
if (A[left] == X) left --;
while (left >= 0 && right < n && count < k)
{
if (x-A[left] <A[right]-X){
Console.Write(A[left]);
left--;
}else{
Console.Write(A[right]);
right--;
}
}
while (count < k && left >= 0){
console.Write(A[left]);
left--;
}
while (count < k && right <= n-1){
console.Write(A[right]);
right
}
}
One of the key observations is that these metrics could be pushed directly from the api of the services.
No comments:
Post a Comment