
Detection of outlier points in time courses of an experiment in PhenoArch greenhouse. This procedure can be used in any kind of time courses data sets. It uses a locfit smoothing function from the locfit library [2]. For each time course of a dataset, a locfit smoothing is applied, predictive confidence interval calculated (Y\(\_\)hat +/- threshold*Y\(\_\)hat\(\_\)se).

Points are declared outlier if outside this confidence interval. the user choose the threshold.

FuncDetectPointOutlierLocFit: detection of outlier points in time courses

  • @param datain input dataframe. This dataframe contains a set of time courses
  • @param myparam character, name of the variable to model in datain (for example, Biomass, PH or LA and so on)
  • @param mytime character, name of the time variable in datain which must be numeric
  • @param myid character, name of the id variable in datain
  • @param mylevel numeric, factor to calculate the confidence interval. Increase mylevel to exclude less outliers
  • @param mylocfit numeric, The constant component of the smoothing parameter. (see the locfit()) Increase mylocfit to have a very smooth curve

@return a data.frame:

Ref: the id variable mytime: name of the time variable in datain myparam: name of the modeled variable in datain ypred: the locfit prediction sd_ypred: standard deviation of the prediction lwr: lower bound of the confidence interval upr: upper bound of the confidence interval outlier: flag of detected outlier (0 is outlier, 1 is not)

If a time course has less than 6 points, no smoothing would be done and a warning appears.


Import of data

In this vignette, we use a toy data set of the openSilexStatR library (anonymized real data set).

## 'data.frame':    47022 obs. of  14 variables:
##  $ Ref            : Factor w/ 1680 levels "manip1_10_10_WW",..: 131 131 131 131 131 131 131 131 131 131 ...
##  $ experimentAlias: Factor w/ 1 level "manip1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Day            : Factor w/ 42 levels "2013-02-01","2013-02-02",..: 3 4 5 6 7 9 9 10 11 12 ...
##  $ potAlias       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ scenario       : Factor w/ 2 levels "WD","WW": 2 2 2 2 2 2 2 2 2 2 ...
##  $ genotypeAlias  : Factor w/ 274 levels "11430_H","A310_H",..: 165 165 165 165 165 165 165 165 165 165 ...
##  $ repetition     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Line           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Position       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ thermalTime    : num  1.29 2.65 3.98 5.32 6.66 ...
##  $ plantHeight    : num  140 151 213 239 271 ...
##  $ leafArea       : num  0.018 0.019 0.0208 0.0222 0.0235 ...
##  $ biovolume      : num  0.253 0.62 1.201 1.68 3.396 ...
##  $ Repsce         : Factor w/ 15 levels "1_WD","1_WW",..: 2 2 2 2 2 2 2 2 2 2 ...

Outlier points detection

I have chosen a smoothing parameter of 30 and a threshold of 10 to detect the outlier points.

  # Selection of only 2 genotypes to speed up the process
  plantSel <- c("11430_H","A310_H")
  mydataSub <- filter(mydata,genotypeAlias %in% plantSel)



## Warning: Removed 15 rows containing missing values (geom_point).

