11. Box Plots

11.1. Description

Box plots, or box and whisker plots, are a way to summarize a distribution of values using Tukey’s 5-number summary (Hoaglin et al., 1983). The dark line in the middle of the boxes (sometimes called the ‘waist’) is the median of data. Half of the data values have a value greater than the median, and half have a value lower. The actual box (i.e. the central rectangle) spans the first quartile to the third quartile (or the interquartile range or IQR). Whiskers extend to 1.5 times the height of the box or, if closer to the median, the minimum or maximum values of the data. The points represent outliers: any data point more than 1.5 times the IQR away from the median. These are defined as values that do not fall inside the whiskers. Outliers are extreme values.

Box plots work best for comparing a set of continuous values (e.g. some verification statistic) conditionally for a set of discrete categories. The example below shows RMSE (root-mean-squared-error; a continuous measure of forecast quality) for each of several distinct lead times. Other examples could be probability of detection (POD) or Gilbert Skill Score (GSS) across a set of models for a specific precipitation threshold. Here, each forecast and observation is binary, but the verification statistic is continuous.

11.2. How-To

Selection of options to produce the plot proceeds approximately counter-clockwise around the METviewer window. The steps to create a series plot are:

  1. Select the desired database from the “Select databases” pulldown menu at the top margin of the METviewer window.

  2. There are a number of tabs just under the database pulldown menu. Select the ‘Box’ tab.

  3. Select the type of MET statistics that will be used to create the box plot. Click on the “Plot Data” pulldown menu which is located under the tabs. The list contains “Stat”, “MODE”, or “MODE-TD”. For details about these types of output statistics in MET, please see the most recent version of the MET User’s Guide.

  4. Select the desired variable to calculate statistics for in the “Y1 Axis Variables” tab. The first pulldown menu in the “Y1 Dependent (Forecast) Variables” section lists the variables available in the selected dataset.

  5. Select the desired statistic to calculate in the second pulldown menu which is to the right of the variable menu. This lists the available attribute statistics for the selected dataset. Multiple statistics can be selected and they will each be plotted as separate boxes on the plot.

  6. Select the Y1 Series Variable from the first pulldown menu in that section. There are many options. “MODEL” is used in the included example. In the second pulldown menu to the right of the first are the series variable options, for example, different models.

  7. It usually does not make sense to mix statistics for different groups. The desired group to calculate statistics over can be specified using the “Fixed Values” section. In the example below, a single domain (category: “VX_MASK”, value: “CONUS”) and a single level (category: “FCST_LEV”, value: “Z2”) are chosen. If multiple domains or thresholds were chosen, the statistics would be a summary of all of those cases together, which may not always be desired.

  8. Select the x-axis value in the “Independent Variable” dropdown menu. For a box plot, this is often a date, lead time, or threshold. In the example in the next section, the Y1 dependent variable “RMSE” is plotted for the ensemble member selected in “Y1 Series Variable” and is plotted over forecast lead time.

  9. Select the type of statistics summary by selecting either “Summary” or “Aggregation Statistics” button in the “Statistics” section. Aggregated statistics may be selected for certain varieties of statistics. The selection can be made from the leftmost dropdown menu in the “Statistics” section. By default, the median value of all statistics will be plotted. Using the dropdown menu, the mean or sum may be selected instead. Choosing this option will cause a single statistic to be calculated from the individual database lines.

  10. There is a “Plot Configurations” that has options specific to box plots, including whether or not to show outliers, points, notches, and more. The box width can also be altered here.

  11. Now enough information has been entered to produce a graph. To do this, click the “Generate Plot” button at the top of the METviewer window (this is in red text). Typically, if a plot is not produced, it is because the database selected does not contain the correct type of data. Also, it is imperative to check the data used for the plot by selecting the “R data” tab on the right hand side, above the plot area. The data from the database that is being used to calculate the statistics is listed in this tab. This tab should be checked to avoid the accidental accumulation of inappropriate database lines. For example, it does not make sense to accumulate statistics over different domains, thresholds, models, etc.

There are many other options for plots, but these are the basics.

11.3. Example

The example below shows a boxplot of the RMSE for 2m temperature over the CONUS. Many of the standard METviewer plotting options are available for the boxplot.

../_images/boxplots_plot.png

Figure 11.1 Example Boxplot created by METviewer for RMSE of 2m temperature over the CONUS by lead time.

Here is the associated xml for this example. It can be copied into an empty file and saved to the desktop then uploaded into the system by clicking on the “Load XML” button in the upper-right corner of the GUI. This XML can be downloaded from this link: boxplots_xml.xml.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<plot_spec>
    <connection>
        <host>mohawk</host>
        <database>mv_hrrr_sppmp_test</database>
        <user>******</user>
        <password>******</password>
        <management_system>mariadb</management_system>
    </connection>
    <rscript>/usr/local/R/bin/Rscript</rscript>
    <folders>
        <r_tmpl>/opt/vxwww/tomcat/webapps/metviewer//R_tmpl</r_tmpl>
        <r_work>/opt/vxwww/tomcat/webapps/metviewer//R_work</r_work>
        <plots>/d2/www/dtcenter/met/metviewer_output//plots</plots>
        <data>/d2/www/dtcenter/met/metviewer_output//data</data>
        <scripts>/d2/www/dtcenter/met/metviewer_output//scripts</scripts>
    </folders>
    <plot>
        <template>box_plot.R_tmpl</template>
        <dep>
            <dep1>
                <fcst_var name="TMP">
                    <stat>RMSE</stat>
                </fcst_var>
            </dep1>
            <dep2/>
        </dep>
        <series1>
            <field name="model">
                <val>HRRR_sppmp_mem1_hrconus</val>
            </field>
        </series1>
        <series2/>
        <plot_fix>
            <field equalize="false" name="vx_mask">
                <set name="vx_mask_0">
                    <val>CONUS</val>
                </set>
            </field>
            <field equalize="false" name="fcst_lev">
                <set name="fcst_lev_1">
                    <val>Z2</val>
                </set>
            </field>
        </plot_fix>
        <plot_cond/>
        <indep equalize="false" name="fcst_lead">
            <val label="3" plot_val="">30000</val>
            <val label="6" plot_val="">60000</val>
            <val label="9" plot_val="">90000</val>
            <val label="12" plot_val="">120000</val>
            <val label="15" plot_val="">150000</val>
            <val label="18" plot_val="">180000</val>
            <val label="21" plot_val="">210000</val>
            <val label="24" plot_val="">240000</val>
            <val label="27" plot_val="">270000</val>
            <val label="30" plot_val="">300000</val>
            <val label="33" plot_val="">330000</val>
            <val label="36" plot_val="">360000</val>
        </indep>
        <plot_stat>median</plot_stat>
        <tmpl>
            <data_file>plot_20200918_211637.data</data_file>
            <plot_file>plot_20200918_211637.png</plot_file>
            <r_file>plot_20200918_211637.R</r_file>
            <title>2-m Temperature CONUS</title>
            <x_label>Lead Time</x_label>
            <y1_label>RMSE</y1_label>
            <y2_label/>
            <caption/>
            <job_title/>
            <keep_revisions>false</keep_revisions>
            <listdiffseries1>list()</listdiffseries1>
            <listdiffseries2>list()</listdiffseries2>
        </tmpl>
        <execution_type>Rscript</execution_type>
        <event_equal>false</event_equal>
        <vert_plot>false</vert_plot>
        <x_reverse>false</x_reverse>
        <num_stats>false</num_stats>
        <indy1_stag>false</indy1_stag>
        <indy2_stag>false</indy2_stag>
        <grid_on>true</grid_on>
        <sync_axes>false</sync_axes>
        <dump_points1>false</dump_points1>
        <dump_points2>false</dump_points2>
        <log_y1>false</log_y1>
        <log_y2>false</log_y2>
        <varianceinflationfactor>false</varianceinflationfactor>
        <plot_type>png16m</plot_type>
        <plot_height>8.5</plot_height>
        <plot_width>11</plot_width>
        <plot_res>72</plot_res>
        <plot_units>in</plot_units>
        <mar>c(8,4,5,4)</mar>
        <mgp>c(1,1,0)</mgp>
        <cex>1</cex>
        <title_weight>2</title_weight>
        <title_size>1.4</title_size>
        <title_offset>-2</title_offset>
        <title_align>0.5</title_align>
        <xtlab_orient>1</xtlab_orient>
        <xtlab_perp>-0.75</xtlab_perp>
        <xtlab_horiz>0.5</xtlab_horiz>
        <xtlab_freq>0</xtlab_freq>
        <xtlab_size>1</xtlab_size>
        <xlab_weight>1</xlab_weight>
        <xlab_size>1</xlab_size>
        <xlab_offset>2</xlab_offset>
        <xlab_align>0.5</xlab_align>
        <ytlab_orient>1</ytlab_orient>
        <ytlab_perp>0.5</ytlab_perp>
        <ytlab_horiz>0.5</ytlab_horiz>
        <ytlab_size>1</ytlab_size>
        <ylab_weight>1</ylab_weight>
        <ylab_size>1</ylab_size>
        <ylab_offset>-2</ylab_offset>
        <ylab_align>0.5</ylab_align>
        <grid_lty>3</grid_lty>
        <grid_col>#cccccc</grid_col>
        <grid_lwd>1</grid_lwd>
        <grid_x>listX</grid_x>
        <x2tlab_orient>1</x2tlab_orient>
        <x2tlab_perp>1</x2tlab_perp>
        <x2tlab_horiz>0.5</x2tlab_horiz>
        <x2tlab_size>0.8</x2tlab_size>
        <x2lab_size>0.8</x2lab_size>
        <x2lab_offset>-0.5</x2lab_offset>
        <x2lab_align>0.5</x2lab_align>
        <y2tlab_orient>1</y2tlab_orient>
        <y2tlab_perp>0.5</y2tlab_perp>
        <y2tlab_horiz>0.5</y2tlab_horiz>
        <y2tlab_size>1</y2tlab_size>
        <y2lab_size>1</y2lab_size>
        <y2lab_offset>1</y2lab_offset>
        <y2lab_align>0.5</y2lab_align>
        <legend_box>o</legend_box>
        <legend_inset>c(0, -.25)</legend_inset>
        <legend_ncol>1</legend_ncol>
        <legend_size>0.8</legend_size>
        <caption_weight>1</caption_weight>
        <caption_col>#333333</caption_col>
        <caption_size>0.8</caption_size>
        <caption_offset>3</caption_offset>
        <caption_align>0</caption_align>
        <ci_alpha>0.05</ci_alpha>
        <plot_ci>c("none")</plot_ci>
        <show_signif>c(FALSE)</show_signif>
        <plot_disp>c(TRUE)</plot_disp>
        <colors>c("#ff0000FF")</colors>
        <pch>c(20)</pch>
        <type>c("b")</type>
        <lty>c(1)</lty>
        <lwd>c(1)</lwd>
        <con_series>c(1)</con_series>
        <order_series>c(1)</order_series>
        <plot_cmd/>
        <legend>c("")</legend>
        <y1_lim>c()</y1_lim>
        <x1_lim>c()</x1_lim>
        <y1_bufr>0.04</y1_bufr>
        <y2_lim>c()</y2_lim>
        <box_pts>false</box_pts>
        <box_outline>true</box_outline>
        <box_notch>false</box_notch>
        <box_avg>false</box_avg>
        <box_boxwex>0.2</box_boxwex>
    </plot>
</plot_spec>