Title: | 'ggplot2' Version of "I'm Feeling Lucky!" |
---|---|
Description: | Examines the characteristics of a data frame and a formula to automatically choose the most suitable type of plot out of the following supported options: scatter, violin, box, bar, density, hexagon bin, spine plot, and heat map. The aim of the package is to let the user focus on what to plot, rather than on the "how" during exploratory data analysis. It also automates handling of observation weights, logarithmic axis scaling, reordering of factor levels, and overlaying smoothing curves and median lines. Plots are drawn using 'ggplot2'. |
Authors: | Stefan Schroedl [aut, cre], Christophe Regouby [ctb] |
Maintainer: | Stefan Schroedl <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.1.9000 |
Built: | 2024-11-16 04:20:45 UTC |
Source: | https://github.com/stefan-schroedl/plotluck |
The purpose of plotluck
is to let the user focus on what to plot,
and automate the how. Given a dependency formula with up to three
variables, it tries to choose the most suitable type of plot. It also automates
sampling large datasets, correct handling of observation weights, logarithmic
axis scaling, ordering and pruning of factor levels, and overlaying smoothing
curves or median lines.
plotluck(data, formula, weights, opts = plotluck.options(), ...)
plotluck(data, formula, weights, opts = plotluck.options(), ...)
data |
a data frame. |
||||||||||||||||||||||||||||
formula |
an object of class
In addition to these base plot types, the dot symbol
See also section "Generating multiple plots at once" below. |
||||||||||||||||||||||||||||
weights |
observation weights or frequencies (optional). |
||||||||||||||||||||||||||||
opts |
a named list of options (optional); See also |
||||||||||||||||||||||||||||
... |
additional parameters to be passed to the respective ggplot2 geom objects. |
a ggplot object, or a plotluck.multi object if the dot symbol was used.
Besides the shape of the formula, the algorithm takes into account the type of variables as either numeric, ordered, or unordered factors. Often, it makes sense to treat ordered factors similarly as numeric types.
One-variable numeric (resp. factor) distributions are usually represented by
density (resp. Cleveland dot) charts, but can be overridden to histograms or
bar plots using the geom
option. Density plots come with an overlaid
vertical median line.
For two numerical variables, by default a scatter plot is produced, but for
high numbers of points a hexbin is preferred (option min.points.hex
).
These plots come with a smoothing line and standard deviation.
The relation between two factor variables can be depicted best by spine
(a.k.a., mosaic) plots, unless they have too many levels (options
max.factor.levels.spine.x
, max.factor.levels.spine.y
,
max.factor.levels.spine.z
). Otherwise, a heat map is produced.
For a mixed-type (factor/numeric) pair of variables, violin (overridable
to box) plots are generated. However, if the resulting graph would contain
too many (more than max.factor.levels.violin
) violin plots in a row,
the algorithm switches automatically. The number of bins of a histogram can
be customized with n.breaks.histogram
. The default setting, NA
,
applies a heuristic estimate.
The case of a response two dependent variables ('y~x+z') is covered by either a spine plot (if all are factors) or a heat map.
In many cases with few points for one of the aggregate plots, a scatter
looks better (options min.points.density
, min.points.violin
,
min.points.hex
).
If each factor combination occurs only once in the data set, we resort to bar plots.
Conditional variables are represented by either
trying to fit into the same graph using coloring (max.factor.levels.color
),
or by facetting (preferred dimensions facet.num.wrap
(resp.
facet.num.grid
) for one resp. two variables). Numeric vectors are
discretized accordingly. Facets are laid out horizontally or vertically
according to the plot type, up to maximum dimensions of facet.max.rows
and facet.max.cols
.
To better illustrate the relation between an independent factor variable and a dependent numerical variable (or an ordered factor), levels are reordered according to the value of the dependent variable. If no other numeric or ordered variable exists, we sort by frequency.
Argument weights
allows to specify weights
or frequency counts for each row of data. All plots and summary statistics
take weights into account when supplied. In scatter and heat maps, weights
are indicated either by a shaded disk with proportional area (default) or by
jittering (option dedupe.scatter
), if the number of duplicated points
exceeds min.points.jitter
. The amount of jittering can be controlled
with jitter.x
and jitter.y
.
plotluck
supports logarithmic and log-modulus
axis scaling. log-modulus is considered if values are both positive and
negative; in this case, the transform function is f(x) = sign(x) *
log(1+abs(x))
.
The heuristic to apply scaling is based on the proportion of total display
range that is occupied by the 'core' region of the distribution between the
lower and upper quartiles; namely, the fact whether the transform could
magnify this region by a factor of at least trans.log.thresh
.
By default, missing (NA
or NaN
) values
in factors are are shown as a special factor level code"?". They can be
removed by setting na.rm=TRUE
. Conventionally, missing numeric values
are not shown.
For very large data sets, plots can take a very long time
(or even crash R). plotluck
has a built-in stop-gap: If the data
comprises more than sample.max.rows
, it will be sampled down to that
size (taking into account weights
, if supplied).
Character (resp. logical) vectors are converted to unordered (resp. ordered) factors.
Frequently, when numeric variables have very few values despite sufficient
data size, it helps to treat these values as the levels of a factor; this is
governed by option few.unique.as.factor
.
If an unordered factor has too many levels, plots can get messy. In this
case, only the max.factor.levels
most frequent ones are retained,
while the rest are merged into a default level ".other."
.
If color
or fill
aesthetics are used to
distinguish different levels or ranges of a variable, the color scheme adjusts
to the type. Preferably, a sequential (resp. qualitative) palette is chosen
for a numeric/ordered (unordered) factor (palette.brewer.seq
,
palette.brewer.qual
); see also RColorBrewer.
If formula
contains a dot
("."
) symbol, the function creates a number of 1D or 2D plots by calling
plotluck
repeatedly. As described above, this allows either single
distribution, one-vs-all and all-vs-all variable plots. To save space,
rendering is minimal without axis labels.
In the all-vs-all case, the diagonal contains 1D distribution plots, analogous
to the behavior of the default plot method for data frames, see
plot.data.frame
.
With setting in.grid=FALSE
, plots are produced in a sequence, otherwise
together on one or multiple pages, if necessary (default). Page size is
controlled by multi.max.rows
and multi.max.cols
.
With entropy.order=TRUE
, plots are sorted by an estimate of
empirical conditional entropy, with the goal of prioritizing the more
predictive variables. Set verbose=TRUE
if you want to see the actual
values. For large data sets the calculation can be time consuming; entropy
calculation can be suppressed by setting multi.entropy.order=FALSE
.
@note The return value is an object of class plotluck_multi
. This
class does not have any functionality; its sole purpose is to make this
function work in the same way as ggplot
and plotluck
, namely,
do the actual drawing if and only if the return value is not assigned.
With the option verbose=TRUE
turned on, the function
will print out information about the chosen and applicable plot types, ordering,
log scaling, etc.
Variable names can be abbreviated if they match a column name uniquely by prefix.
By default, plotluck
uses violin and density plots in place of the more traditional box-and-whisker
plots and histograms; these modern graph types convey the shape of a
distribution better. In the former case, summary statistics like mean and
quantiles are less useful if the distribution is not unimodal; a wrong
choice of the number of bins of a histogram can create misleading artifacts.
Following Cleveland's advice, factors are plotted on the y-axis to make labels
most readable and compact at the same time. This direction can be controlled
using option prefer.factors.vert
.
Due to their well-documented problematic aspects, pie charts and stacked bar graphs are not supported.
With real-world data (as opposed to smooth mathematical functions), three-dimensional scatter, surface, or contour plots can often be hard to read if the shape of the distribution is not suitable, data coverage is uneven, or if the perspective is not carefully chosen depending on the data. Since they usually require manual tweaking, we have refrained from incorporating them.
For completeness, we have included the description of option parameters in the current help page. However, the tenet of this function is to be usable "out-of-the-box", with no or very little manual tweaking required. If you find yourself needing to change option values repeatedly or find the presets to be suboptimal, please contact the author.
plotluck
is designed for generic out-of-the-box
plotting, and not suitable to produce more specialized types of plots that
arise in specific application domains (e.g., association, stem-and-leaf,
star plots, geographic maps, etc). It is restricted to at most three variables.
Parallel plots with variables on different scales (such as time
series of multiple related signals) are not supported.
plotluck.options
, sample.plotluck
, ggplot
# Single-variable density data(diamonds, package='ggplot2') plotluck(diamonds, price~1) invisible(readline(prompt="Press [enter] to continue")) # Violin plot data(iris) plotluck(iris, Species~Petal.Length) invisible(readline(prompt="Press [enter] to continue")) # Scatter plot data(mpg, package='ggplot2') plotluck(mpg, cty~model) invisible(readline(prompt="Press [enter] to continue")) # Spine plot data(Titanic) plotluck(as.data.frame(Titanic), Survived~Class+Sex, weights=Freq) invisible(readline(prompt="Press [enter] to continue")) # Facetting data(msleep, package='ggplot2') plotluck(msleep, sleep_total~bodywt|vore) invisible(readline(prompt="Press [enter] to continue")) # Heat map plotluck(diamonds, price~cut+color) # Multi plots # All 1D distributions plotluck(iris, .~1) # 2D dependencies with one fixed variable on vertical axis plotluck(iris, Species~.) # See also tests/testthat/test_plotluck.R for more examples!
# Single-variable density data(diamonds, package='ggplot2') plotluck(diamonds, price~1) invisible(readline(prompt="Press [enter] to continue")) # Violin plot data(iris) plotluck(iris, Species~Petal.Length) invisible(readline(prompt="Press [enter] to continue")) # Scatter plot data(mpg, package='ggplot2') plotluck(mpg, cty~model) invisible(readline(prompt="Press [enter] to continue")) # Spine plot data(Titanic) plotluck(as.data.frame(Titanic), Survived~Class+Sex, weights=Freq) invisible(readline(prompt="Press [enter] to continue")) # Facetting data(msleep, package='ggplot2') plotluck(msleep, sleep_total~bodywt|vore) invisible(readline(prompt="Press [enter] to continue")) # Heat map plotluck(diamonds, price~cut+color) # Multi plots # All 1D distributions plotluck(iris, .~1) # 2D dependencies with one fixed variable on vertical axis plotluck(iris, Species~.) # See also tests/testthat/test_plotluck.R for more examples!
plotluck
Create options structure for plotluck
plotluck.options(opts, ...)
plotluck.options(opts, ...)
opts |
An (optional) named list to start with. Anything not specified in ... will be inherited from opts. |
... |
Parameters to override default settings |
A named list of options, usable as argument to function plotluck
.
plotluck
accepts a list of options to modify its behavior. Calling
plotluck.options
without arguments produces a list with the default
values. Specifying any number of attribute/value pairs overrides them
selectively.
Option | Default | Comment |
na.rm |
FALSE |
Do not show missing factor values as separate level. |
geom |
"auto" |
Override type of plot; the available types for a given formula and variables can be inspected with verbose=TRUE . |
sample.max.rows |
100000 |
If the data set has more rows than that, sample it down. |
trans.log.thresh |
2 |
Threshold for logarithmic axis scaling. Visible magnification factor of the central region of the distribution. |
n.breaks.histogram |
NA |
Override the number of histogram breaks. |
min.points.hex |
5000 |
Minimum data points required to display a hexbin plot. |
min.points.density |
20 |
Minimum data points required to display a density or histogram plot. |
min.points.violin |
20 |
Minimum data points required to display a violin or box plot. |
resolution.heat |
30 |
Grid spacing for heat maps. |
dedupe.scatter |
'area' |
To represent multiple instances of the same coordinates in scatter plot: scale the point size, or use jitter? |
min.points.jitter |
3 |
Minimum number of coordinate duplicates to start jittering points. |
jitter.x |
0.4 |
Amount of jitter to apply in horizontal direction, as a fraction of resolution. |
jitter.y |
0.4 |
Amount of jitter to apply in vertical direction, as a fraction of resolution. |
few.unique.as.factor |
5 |
If a numeric variable has less than that many unique values, make it an ordered factor. |
max.factor.levels |
30 |
For factors with more than that many levels, least frequent ones will be pruned into "other". |
max.factor.levels.color |
3 |
Maximum number of factor levels that can be represented as colors in the same plot. |
max.factor.levels.violin |
20 |
Maximum number of levels to plot violins; rather switch to box plot. |
max.factor.levels.spine.x |
20 |
Maximum number of levels to plot in x-direction in a spine plot. |
max.factor.levels.spine.y |
10 |
Maximum number of levels to plot in y-direction in a spine plot. |
max.factor.levels.spine.z |
5 |
Maximum number of levels to represent as colors in a spine plot. |
spine.plot.margin.x |
0.05 |
Horizontal gap between rectangles in a spine plot. |
spine.plot.margin.y |
0.02 |
Vertical gap between rectangles in a spine plot. |
facet.max.cols |
10 |
Maximum number of facet columns for conditional variables. |
facet.max.rows |
10 |
Maximum number of facet rows for conditional variables. |
facet.num.wrap |
6 |
Preferred number of facets for single conditional variable. |
facet.num.grid |
3 |
Preferred number of facets for each of two conditional variables. |
prefer.factors.vert |
TRUE |
In mixed numeric/factor plots, use vertical axis for the factor. |
fill.default |
"deepskyblue" |
Default fill color for density and histogram plots. |
palette.brewer.seq |
"YlGn" |
Sequential brewer palette name. |
palette.brewer.qual |
"Set1" |
Qualitative brewer palette name. |
multi.entropy.order |
TRUE |
Use estimated conditional entropy to order multi-plots. |
multi.max.rows |
6 |
Maximum number of rows for multi-plots. |
multi.max.cols |
6 |
Maximum number of columns for multi-plots. |
multi.in.grid |
TRUE |
In multi-plots, make a page with multiple plots, or generate each one separately. |
verbose |
FALSE
|
Print information about plot types, ordering, scaling, etc. |
plotluck
's aim is to provide a function that is usable
"out-of-the-box", with no or very little manual tweaking. If you find
yourself needing to change option values repeatedly or find the presets to
be suboptimal, please contact the author.
# list all default options plotluck.options() data(iris) # default with violin plot plotluck(iris, Petal.Length~Species) # use box-and-whiskers plot instead plotluck(iris, Petal.Length~Species, opts=plotluck.options(geom='box'))
# list all default options plotluck.options() data(iris) # default with violin plot plotluck(iris, Petal.Length~Species) # use box-and-whiskers plot instead plotluck(iris, Petal.Length~Species, opts=plotluck.options(geom='box'))
plotluck
for a randomly generated formula.sample.plotluck
samples a formula as follows:
Uniformly draw the number of variables (1-3).
For each variable, uniformly choose one of the existing variable types from the data set (numeric, ordered or unordered factor).
Uniformly select one of the data frame columns of that type.
sample.plotluck(data, ...)
sample.plotluck(data, ...)
data |
a data frame |
... |
additional parameters to be passed to |
a ggplot2 object.
set.seed(42) data(iris) sample.plotluck(iris)
set.seed(42) data(iris) sample.plotluck(iris)