Monitoring Metrics

Nov 14, 2018

The Metrics page in the Drill Web UI (http(s)://<drillbit-ip-address>:8047/metrics) lists JVM, operating system, and certain Drill-specific metrics. You can use these metrics to debug the state of the cluster. The Drill-specific metrics are prepended with drill, for example drill.fragments.running.

Drill uses JMX (Java Management Extensions) to expose metrics at runtime. JMX provides the architecture to dynamically manage and monitor applications. JMX collects Drill system-level metrics that you can see in the Metrics tab in the Drill Web UI or a remote JMX monitoring tool, such as JConsole or the VisualVM + MBeans plugin.

Metrics collected by JMX are divided into the following categories on the Metrics page in the Drill Web UI:

  • Gauges
    A gauge is an instantaneous measure of a value. See Gauges.
  • Counters
    A counter is a snapshot of the count of metrics at a particular point in time. (A gauge for an AtomicLong instance.) See Counters.
  • Histograms
    A histogram measures the statistical distribution of values in a stream of data. See Histograms.
  • Meters
    A meter measures the rate of events over time, for example requests per second. Drill currently does not use meters to report system-level metrics.
  • Timers
    A timer measures the rate that a particular piece of code is called and the distribution of its duration. See Timers.

Remote Monitoring

You can enable the remote JMX Java feature to monitor a specific JVM from a remote location. You can enable remote JMX with or without authentication. See the Java documentation.

In $DRILL_HOME/conf/drill-env.sh, use the DRILLBIT_JAVA_OPTS variable to pass the relevant parameters. For example, to add remote monitoring on port 8048 without authentication:

   export DRILLBIT_JAVA_OPTS=”$DRILLBIT_JAVA_OPTS -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=8048”  

Disabling Drill Metrics

JMX metric collection is enabled, by default. You can disable the metrics option if needed.

In $DRILL_HOME/conf/drill-env.sh, set the drill.metrics.jmx.enabled option to false through the DRILLBIT_JAVA_OPTS variable. Add the variable in drill-env.sh if it does not exist:

   export DRILLBIT_JAVA_OPTS="$DRILLBIT_JAVA_OPTS -Ddrill.metrics.jmx.enabled=false"   

Gauges

The following table lists the Drill-specific metrics in the Gauges section of the Metrics page:

Metric Description
blocked.count The number of threads that are blocked because they are waiting on a monitor lock. This metric is useful for debugging Drill issues.
drill.fragments.running The number of query fragments currently running in the drillbit.
drill.allocator.root.used The amount of memory (in bytes) used by the internal memory allocator.
drill.allocator.root.peak The peak amount of memory (in bytes) used by the internal memory allocator.
drill.allocator.rpc.bit.control.peak The maximum amount of bytes used across all outgoing and incoming control connections for this Drillbit at the current time.
drill.allocator.rpc.bit.control.used The total number of bytes currently used across all outgoing and incoming control connections for this Drillbit.
drill.allocator.rpc.bit.data.peak The maximum amount of memory used between all outgoing and incoming data connections for this Drillbit at the current time.
drill.allocator.rpc.bit.data.used The total amount of memory used between all outgoing and incoming data connections tor this Drillbit.
drill.allocator.rpc.bit.user.peak The maximum amount of memory used across all incoming Drill client connections to this Drillbit at the current time.
drill.allocator.rpc.bit.user.used The total amount of memory used across all incoming Drill client connections to this Drillbit.
drill.allocator.huge.size Total size in bytes of huge (greater than 16MB) direct buffers currently allocated.
drill.allocator.huge.count Number of allocations done for direct buffer of size greater than 16MB. Each of these allocation happens from OS which comes with an overhead rather from Netty's buffer pool.
drill.allocator.normal.count Number of allocations done for direct buffer of size less than equal to 16MB. Each of these allocation happens from Netty's buffer pool. This counter is only updated in debug environment when asserts are enabled to avoid overhead for each allocation during normal execution.
drill.allocator.normal.size Total size in bytes of normal (less than and equal to 16MB) direct buffers currently allocated. This counter is only updated in debug environment when asserts are enabled to avoid overhead for each allocation during normal execution.
count The number of live threads, including daemon and non-daemon threads.
heap.used The amount of heap memory (in bytes) used by the JVM.
non-heap.used The amount of non-heap memory (in bytes) used by the JVM.
fd.usage The ratio of used file descriptors to total file descriptors on *nix systems.
direct.used The amount of direct memory (in bytes) used by the JVM. This metric is useful for debugging Drill issues.
runnable.count The number of threads executing an action in the JVM. This metric is useful for debugging Drill issues.
waiting.count The number of threads waiting to execute. Typically, threads waiting on other threads to perform an action. This metric is useful for debugging Drill issues.
load.avg Returns the "recent cpu usage" for the Drillbit process. This value is a double in the [0.0,1.0] interval. A value of 0.0 means that none of the CPUs were running threads from the Drillbit process during the recent period of time observed, while a value of 1.0 means that all CPUs were actively running threads from the Drillbit process 100% of the time during the recent period being observed. Threads from the Drillbit process includes the application threads as well as the JVM internal threads. All values betweens 0.0 and 1.0 are possible depending of the activities going on in the Drillbit process and the whole system. If the recent CPU usage is not available, the method returns a negative value. See getProcesCpuLoad().
uptime Total uptime of Drillbit JVM in miliseconds. See getUptime().

Counters

The following table lists the Drill-specific metrics in the Counters section of the Metrics page:

Metric Description
drill.connections.rpc.control.encrypted The total number of encrypted incoming and outgoing control connections to and from this Drillbit. This includes both the control client and server connections.
drill.connections.rpc.control.unencrypted The total number of unencrypted incoming and outgoing control connections to and from this Drillbit. This includes both the control client and server connections.
drill.connections.rpc.data.encrypted The total number of encrypted incoming and outgoing data connections to and from this Drillbit. This includes both the data client and data server connections.
drill.connections.rpc.data.unencrypted The total number of unencrypted incoming and outgoing data connections to and from this Drillbit. This includes both the data client and data server connections.
drill.connections.rpc.user.encrypted The total number of encrypted connections from the all the Drill clients to this Drillbit.
drill.connections.rpc.user.unencrypted The total number of unencrypted connections from the all the Drill clients to this Drillbit.
drill.queries.canceled The number of canceled queries for which this Drillbit was the Foreman.
drill.queries.completed The number of queries completed, canceled, or failed for which this Drillbit was the Foreman.
drill.queries.enqueued The number of waiting queries across all of the configured queues for which this Drillbit is the Foreman.
drill.queries.failed The number of failed queries for which this Drillbit was the Foreman.
drill.queries.planning The number of queries that are in the planning stage for which the Drillbit is the Foreman.
drill.queries.running The number of queries running for which this Drillbit is the Foreman.
drill.queries.succeeded The number of successful queries for which this Drillbit was the Foreman.

Histograms

The following table lists the Drill-specific metrics in the Histograms section of the Metrics page:

Reporting Class Description
drill.allocator.huge.hist Displays the distribution of allocation of huge buffers up to the current time. Like count, it specifies number of huge buffer allocations completed so far. Max/Min specifies maximum/minimum size in bytes of the huge buffer allocated. Mean and other percentiles show the distribution of the huge buffer allocation size in bytes.
drill.allocator.normal.hist Displays the distribution of allocation of the normal size buffers up to the current time. Like count, it specifies the number of normal buffer allocations completed so far. Max/Min specifies maximum/minimum size in bytes of the normal buffer allocated. Mean and other percentiles show the distribution of normal buffer allocation size in bytes.

Meters

Not available.

Timers

Reporting Class Description
org.apache.drill.exec.cache.VectorAccessibleSerializable.writerTime Measures the distribution of the time taken to serialize a record batch to the output stream. Mainly used to measure the time taken to spill a record batch.
org.apache.drill.exec.store.schedule.BlockMapBuilder.blockMapBuilderTimer Measures the distribution of the time taken to build a mapping of block locations for a given file byte range. Mainly used during the planning phase to determine a set of endpoints where all the data is located.