--- knit: (function(inputFile, encoding) { rmarkdown::render(inputFile, encoding=encoding, output_file='NEON_surfaceMicrobe_userGuide_vA.pdf')}) fontsize: 11pt geometry: margin=1in, includefoot, top=1in, bottom=1.25in header-includes: - \usepackage{fancyhdr} #- \usepackage[usenames,dvipsnames,svgnames,table]{xcolor} - \usepackage{lastpage} - \usepackage{hyperref} - \usepackage[hypcap]{caption} - \usepackage{array} - \usepackage{graphicx} - \usepackage[datesep=/]{datetime2} - \usepackage{ltablex} - \usepackage{pifont} - \usepackage{ragged2e} - \usepackage{enumitem} - \usepackage{float} - \usepackage{morefloats} - \usepackage{rotating} - \usepackage{textcomp} - \usepackage{amsmath} linkcolor: black mainfont: Calibri output: pdf_document: fig_caption: yes fig_width: 4 latex_engine: xelatex number_sections: yes html_document: default word_document: default --- [//]:TO DO: ```{r set FilePath, echo=FALSE, message=FALSE} #Set working directory and file paths # it's nice to just make this so it points to your own directories rm(list=ls()) if (file.exists('C:/Users/ritterm1')){ wdir<-'C:/Users/ritterm1/Documents/GitHub/definitional-data/DPUGs' } if (file.exists('C:/Users/sparker')){ wdir<-'C:/Users/sparker/Desktop/GitHub/definitional-data/DPUGs' } if (file.exists('C:/Users/cscott')){ wdir<-'C:/Users/cscott/Documents/Github/definitional-data/DPUGS' } if (file.exists('C:/Users/rkrauss')){ wdir<-'C:/Users/rkrauss/Documents/Github/FORKS/definitional-data/DPUGS' } logoFile <- paste0(wdir, "/Logo") myPathToGraphics <- paste(wdir,'amc_userGuide','graphics', sep='/') ``` ```{r myVariables, echo=FALSE} ## set variables ##Note- for future versions, modifications to latex tables below may be necessary #Title page myTitle <- "NEON User Guide for Surface Water Microbe Collection (DP1.20282.001, DP1.20281.001, DP1.20141.002)" firstAuthor <- "Stephanie Parker" rev <- "A" authorOrg <- "AQU" secondAuthor <- "Madie Ritter" secondOrg <- "AQU" #change record revA <- "A" revADate <- "03/13/2026" revADesc <- "Initial Release after removing discontinued cell count data product" # Applicable documents ingestName <- "NEON Raw Data Validation for Surface Water Microbe Cell Count (DP0.20138.001)" ingestNum <- "DP0.20138.001 \\_dataValidation.csv" pubName1 <- "NEON Data Variables for Surface Water Microbe Marker Gene Sequences (DP1.20282.001)" pubNum1 <- "DP1.20282.001 \\_variables.csv" pubName2 <- "NEON Data Variables for Surface Water Microbe Metagenome Sequences (DP1.20281.001)" pubNum2 <- "DP1.20281.001 \\_variables.csv" pubName3 <- "NEON Data Variables for Surface Water Microbe Community Taxonomy (DP1.20141.002)" pubNum3 <- "DP1.20141.002 \\_variables.csv" designName <- "NEON Aquatic Sampling Strategy" designNum <- "NEON.DOC.001152" protocolName <- "AOS Protocol and Procedure: Aquatic Microbial Sampling" protocolNum <- "NEON.DOC.003041" atbdName <- "NEON Algorithm Theoretical Basis Document: OS Data Quality Control" atbdNum <- "NEON.DOC.005424" ``` ```{r libraries, echo=FALSE, message=FALSE, warning=FALSE} #load libraries # call all necessary libraries #if packages listed below are not installed, use install.packages("packageName") before running library(packageName)# library(knitr) library(kfigr) # install TeX program : http://miktex.org/2.9/setup (Windows) ; http://tug.org/mactex/ (Mac) # re-install, update miktex - datetime2 package (called here) was released 3/24/15, all older versions do not have this package pre-loaded. # documentation for kfigr https://github.com/mkoohafkan/kfigr/blob/master/vignettes/introduction.Rmd ``` [//]: TEMPLATE SECTION 1 - DO NOT CHANGE!! ##################################### ##################################### [//]: These are document settings for NEON template, code will not print \title{} \DTMsetdatestyle{mmddyyyy} \thispagestyle{fancy} \sloppy \RaggedRight \pagestyle{fancy} \pagenumbering{gobble} \fancyhf{} \fancyheadoffset[L, R]{0.5in} \fancyfootoffset[L, R]{0.5in} \fancyhead[L]{\includegraphics[width=4cm] {`r logoFile`} \vspace{0.5cm}} \setlength{\headheight}{68.34325pt} \fancyhead[R]{ \footnotesize \setlength{\extrarowheight}{1mm} \begin{tabular}{|p{11cm}|p{3cm}|} \hline {\emph{Title:} `r myTitle`} & \emph{Date:} \today \\ \hline \emph{Author:} `r firstAuthor` & \emph{Revision:} `r rev` \\ \hline \end{tabular} \vspace{1cm} } [//]: check out http://www.tablesgenerator.com/latex_tables for help with code to generate tables \renewcommand{\headrulewidth}{0pt} \maketitle \thispagestyle{fancy} \begin{center} \vspace{1in} \huge \bfseries \uppercase {`r myTitle` \\[0.4cm] } \end{center} \vspace{1in} \setlength{\extrarowheight}{1mm} \begin{tabular}{|m{7.5cm}|m{7.5cm}|} \hline \textbf{PREPARED BY} &\textbf{ORGANIZATION} \\ \hline `r firstAuthor` & `r authorOrg` \\ \hline `r secondAuthor` & `r secondOrg` \\ \hline \end{tabular} \bigskip \newpage \vspace{3in} \begin{center} \bfseries\Large\uppercase{Change Record} \end{center} \vspace{1in} \setlength{\extrarowheight}{1mm} \begin{tabular}{|m{2.5cm}|m{2.5cm}|m{11.5cm}|} \hline \textbf{REVISION} & \textbf{DATE} & \textbf{DESCRIPTION OF CHANGE} \\ \hline `r revA` & `r revADate` & `r revADesc` \\ \hline \end{tabular} \newpage \fancyfoot[L]{ \footnotesize{} } \pagenumbering{roman} \fancyfoot[C]{ \thepage \hspace{1pt} } \renewcommand\contentsname{\uppercase{Table of contents}} \tableofcontents \renewcommand\listtablename{\uppercase{list of tables and figures}} \let\oldnumberline\numberline \renewcommand{\numberline}{\tablename~\oldnumberline} \listoftables \renewcommand{\numberline}{\figurename~\oldnumberline} \renewcommand\listfigurename{} \addtocontents{lof}{\vskip -1.2cm} \listoffigures \newpage \fancyfoot[L]{ \footnotesize{} } \section{DESCRIPTION} \pagenumbering{arabic} \fancyfoot[C]{ Page \thepage \hspace{1pt} of \pageref{LastPage} } [//]: TEMPLATE SECTION 2 ####################################################### [//]: Okay to edit ## Purpose This document provides an overview of the data included in this NEON Level 1 (L1) data product, the quality controlled product generated from raw Level 0 (L0) data, and associated metadata. In the NEON data products framework, the raw data collected in the field, for example, the dry weights of litter functional groups from a single collection event are considered the lowest level (Level 0). Raw data that have been quality checked via the steps detailed herein, as well as simple metrics that emerge from the raw data are considered Level 1 data products. The text herein provides a discussion of measurement theory and implementation, data product provenance, quality assurance and control methods used, and approximations and/or assumptions made during L1 data creation. ## Scope This document describes the steps needed to generate the field records for the L1 data products related to Surface water microbe collection. Sequencing data from the external laboratory are captured in the DPUGs for Marker genes, Metagenomics, and Microbe Community Taxonomy. This document also provides details relevant to the publication of the data products via the NEON data portal, with additional detail available in the file.. This document also describes the field data used in surface water microbial sequencing data products `r pubName1` (AD[05]), `r pubName2` (AD[06]), and `r pubName3` (AD[07]). Details on the publication of these data products can be found in their respective user guides. This document describes the process for ingesting and performing automated quality assurance and control procedures on the data collected in the field pertaining to `r protocolName` (AD[08]). The raw data that are processed in this document are detailed in the file `r ingestName` (AD[04]), provided in the download package for this data product. Please note that raw, L0 data products (denoted by 'DP0') may not always have the same numbers (e.g., '20138') as the corresponding L1 data product. \pagebreak # RELATED DOCUMENTS AND ACRONYMS ## Associated Documents \begin{tabular}{|m{1.5cm}|m{3.5cm}| m{10.5cm}|} \hline AD[01] & NEON.DOC.000001 & NEON Observatory Design (NOD) Requirements \\ \hline AD[02] & `r designNum` & `r designName` \\ \hline AD[03] & NEON.DOC.002652 & NEON Data Products Catalog \\ \hline AD[04] & Available with data download & `r ingestName` \\ \hline AD[05] & Available with data download & `r pubName1` \\ \hline AD[06] & Available with data download & `r pubName2` \\ \hline AD[07] & Available with data download & `r pubName3` \\ \hline AD[08] & `r protocolNum` & `r protocolName` \\ \hline AD[09] & NEON.DOC.000008 & NEON Acronym List \\ \hline AD[10] & NEON.DOC.000243 & NEON Glossary of Terms \\ \hline AD[11] & NEON.DOC.002905 & AOS Protocol and Procedure: Water Chemistry Sampling in Surface Waters and Groundwater\\ \hline AD[12] & NEON.DOC.004825 & NEON Algorithm Theoretical Basis Document: OS Generic Transitions \\ \hline AD[13] & Available on NEON data portal & NEON Ingest Conversion Language Function Library \\ \hline AD[14] & Available on NEON data portal & NEON Ingest Conversion Language \\ \hline AD[15] & Available with data download & Categorical Codes csv \\ \hline AD[16] & `r atbdNum` & `r atbdName` \\ \hline \end{tabular} ## Acronyms \begin{tabular}{|m{3.5cm}|m{12.5cm}|} \hline \bfseries{Acronym}& \bfseries{Definition} \\ \hline PI & Propidium iodide \\ \hline \end{tabular} \newpage # DATA PRODUCT DESCRIPTION This data product contains the quality-controlled, field sampling metadata and field data for surface water microbe analysis. Field samples are collected in the water column of wadeable streams, rivers, and lakes in conjunction with surface water chemistry samples. Bulk water samples are collected and subsampled for genetic analyses, metagenomics, and archive. Surface water microbes are collected 12 times per year in wadeable streams and 6 times per year in lakes and non-wadeable streams, at the same time and location as standard recurrent (monthly) water chemistry samples (AD[11]). Details on sampling locations and timing are provided in `r ingestName` (AD[04]) and the Surface Water Chemistry Sampling in Aquatic Habitats protocol (AD[11]). Samples are collected as grab samples from the water column at sampling locations near the S2 sensor set in streams, near the buoy in rivers, near buoy, inlet, and outlet sensor sets in flow-through lakes, and near the buoy sensor sets in seepage lakes. In lakes and rivers with a stratified water column, samples are collected at multiple depths. Water samples are processed in the field or in the lab within 4 hours of collection if field conditions are not conductive to subsampling (e.g., freezing conditions). Samples for genetic analysis, metagenomics, and archive are filtered on a 0.2 um Sterivex filter and flash-frozen. Genetic filter samples are stored at -80\textsuperscript{o}C until analysis at an external facility. Details of lab analyses are included in the user guides for `r pubName1` (AD[05]), `r pubName2` (AD[06]), and `r pubName3` (AD[07]). DNA-extracts and archived Sterivex filter samples from the field are stored at -80\textsuperscript{o}C and stored at the NEON Biorepository and are available by request for further study and analysis. Contact the [Biorepository](https://www.neonscience.org/samples) for detailed information about sample availability. ## Spatial Sampling Design Aquatic surface water microbial samples are collected at all NEON aquatic sites at the same time and location as surface water chemistry samples (AD[11]). At stream sites, 1 sample is collected from the thalweg <1 m downstream of the downstream sensor set (S2) on each sampling date. Samples represent the water column, so care is taken to avoid stirring up sediments that may contaminate the sample. At river (non-wadeable stream) sites, surface water microbe samples are collected just downstream of the sensor set or profiling buoy (station = 'c0') (\autoref{fig:stream}). If the river is non-stratified, samples are collected at 0.5 m depth. If the river is stratified, an epilimnion sample is collected at 0.5 m (station = 'c1') and an integrated sample is collected from the hypolimion (station = 'c2'). Care is taken to avoid contamination from sediments suspended by the boat motor or anchor. At flow-through lake sites, samples are collected near the profiling buoy, the inlet sensor, and outlet sensor (\autoref{fig:stream}). In seepage lakes with no defined inlet and outlet, samples are only collected near the profiling buoy. Near the buoy, sampling depth is dependent on the presence or absence of lake stratification (\autoref{fig:lake}). In an unstratified lake, the sample is collected near the surface at 0.5 m depth. In a stratified lake, additional samples are collected from the hypolimnion, in addition to the surface water sample. In lakes with a shallow hypolimnion (<4 m), the sample is collected from the midpoint of the hypolimnion. In lakes with a deeper hypolimnion (>4 m), an integrated sample is collected throughout the hypolimnion. Samples collected near the inlet and outlet sensor sets are collected near the surface at 0.5 m depth. See `r protocolName` (AD[08]) and AOS Protocol and Procedure: Water Chemistry Sampling in Surface Waters and Groundwater (AD[11]) for additional details. As much as possible, sampling occurs in the same locations over the lifetime of the Observatory. However, over time some sampling locations may become impossible to sample, due to disturbance or other local changes. When this occurs, the location and its location ID are retired. A location may also shift to slightly different coordinates. Refer to the locations endpoint of the NEON API for details about locations that have been moved or retired: https://data.neonscience.org/data-api/endpoints/locations/ \begin{figure}[H] \centering \includegraphics[width=16cm]{`r paste(myPathToGraphics,'Combined_Microbes_2026.jpg', sep='/')`} \caption{Generic aquatic site layouts with surface water microbe sampling locations highlighted in red boxes for wadeable streams, rivers, and both seepage and flow-through lakes.} \label{fig:stream} \end{figure} \begin{figure}[H] \centering \includegraphics[width=16cm]{`r paste(myPathToGraphics,'lake_sample_depth.jpg', sep='/')`} \caption{Lake and river sampling depths in non-stratified and stratified water columns.} \label{fig:lake} \end{figure} ## Temporal Sampling Design Surface water microbe samples are collected at the same time and location as standard recurrent water chemistry sampling: once per month in wadeable streams (12 times per year). At streams sites, samples are collected year-round including when the stream is frozen over if the ice can be broken by hand. When the ice becomes too thick, sampling is suspended and noted as **samplingImpractical** in the field data. At lake and river sites, microbe samples are collected every other month with standard recurrent water chemistry samples. At northern sites, samples are collected year round and and collected under the ice during winter. See `r designName` (AD[02]), `r protocolName` (AD[08]) and AOS Protocol and Procedure: Water Chemistry Sampling in Surface Waters and Groundwater (AD[11]) for additional details. ## Sampling Design Changes 2014-2017: During the first four sampling years, samples were collected at three locations in all lakes (seepage or flow-through). Starting in 2018, samples from seepage lakes are only collected near the buoy sensor sets. 2024: Field blank samples are collected starting in 2024. Field blank samples are collected once per year in mid-summer and sequenced by the external lab. ## Variables Reported All variables reported from the field or laboratory technician (L0 data) are listed in the file, `r ingestName` (AD[04]). All variables reported in the published data (L1 data) are also provided separately in the file, `r pubName1` (AD[05]). Field names have been standardized with Darwin Core terms (http://rs.tdwg.org/dwc/; accessed 16 February 2014), the Global Biodiversity Information Facility vocabularies (http://rs.gbif.org/vocabulary/gbif/; accessed 16 February 2014), the VegCore data dictionary (https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/VegCore; accessed 16 February 2014), where applicable. NEON OS spatial data employs the World Geodetic System 1984 (WGS84) for its fundamental reference datum and Geoid12A geoid model for its vertical reference surface. Latitudes and longitudes are denoted in decimal notation to six decimal places, with longitudes indicated as negative west of the Greenwich meridian. Some variables described in this document may be for NEON internal use only and will not appear in downloaded data. ## Spatial Resolution and Extent The finest resolution at which spatial data are reported is at a single station within a site. For example, data may be collected at a specific depth in the water column of a lake.The basic spatial data included in the data downloaded include the latitude, longitude, and elevation of the named location at the aquatic site (e.g., the aquatic location) or the latitude and longitude of an alternate location if the named location is not suitable for sampling. \begin{center} \textbf{namedLocation} (unique ID given to the location within the site) \ding{221} \textbf{siteID} (ID of NEON site) \ding{221} \textbf{domainID} (ID of a NEON domain) \end{center} ## Temporal Resolution and Extent The finest resolution at which temporal data are reported is at **collectDate**, the date and time of day when the sample was collected in the field. The NEON Data Portal provides data in monthly files for query and download efficiency. Queries including any part of a month will return data from the entire month. Code to stack files across months is available here: https://github.com/NEONScience/NEON-utilities. ## Associated Data Streams Surface water microbe samples are related to water chemistry samples collected at the same time and location. Water chemistry data are available in the 'Chemical properties of surface water' data product (DP1.20093.001). Data generated from the same parent sample (linked with the field **parentSampleID**) are published in Surface water microbe community composition (DP1.20141.001) - discontinued, Surface water microbe group abundances (DP1.20278.001) - discontinued, Surface water microbe marker gene sequences (DP1.20282.001), Surface water microbe metagenome sequences (DP1.20281.001), and Surface water microbe community taxonomy (DP1.20141.002). ## Product Instances At each stream site, there will be 12 samples collected per year. At a lake or river site, there will be a minimum of 6 samples and a maximum of 9 samples collected per year (maximum if water the column is stratified). Each sample generates one sequencing record at the external lab. ## Data Relationships A record in amc_fieldGenetic must have a corresponding record in amc_fieldSuperParent describing measurement depth and abiotic variables during sample collection. Each record in amc_fieldGenetic may be linked to a record in the downstream laboratory tables from the , which contains data from the external laboratory. Duplicates and/or missing data may exist where protocol and/or data entry aberrations have occurred; users should check data carefully for anomalies before joining tables. amc_fieldSuperParent.csv - > One record is created for each field microbe sample that comes from a surface water sample. amc_fieldGenetic.csv - > One record is created by field personnel for each field count sample. Each field record has a corresponding amc_fieldSuperParent **parentSampleID**. The fields **geneticSampleID**, **metagenomicSampleID**, and **fieldBlankGeneticSampleID** are created here and are used to track samples through to the external lab. The field **archiveSampleID** is also created here and is used to track archived samples sent to the NEON Biorepository. Data downloaded from the NEON Data Portal are provided in separate data files for each site and month requested. The neonUtilities package in R and the neonutilities package in Python contain functions to merge these files across sites and months into a single file for each table. The neonUtilities R package is available from the Comprehensive R Archive Network (CRAN; https://cran.r-project.org/web/packages/neonUtilities/index.html) and can be installed using the install.packages() function in R. The neonutilities package in Python is available on the Python Package Index (PyPi; https://pypi.org/project/neonutilities/) and can be installed using pip. For instructions on using the package in either language to merge NEON data files, see the Download and Explore NEON Data tutorial on the NEON website: https://www.neonscience.org/download-explore-neon-data. ## Special Considerations Not that this data product user guide details the field collection for surface water microbes. Sequencing data are found in the following data products: * Surface water microbe marker gene sequences (DP1.20282.001) * Surface water microbe metagenome sequences (DP1.20281.001) * Surface water microbe community taxonomy (DP1.20141.002) * Surface water microbe community composition (DP1.20141.001) - discontinued * Surface water microbe group abundances (DP1.20278.001) - discontinued # DATA QUALITY ## Data Entry Constraint and Validation Many quality control measures are implemented at the point of data entry within a mobile data entry application or web user interface (UI). For example, data formats are constrained and data values controlled through the provision of drop down options, which reduces the number of processing steps necessary to prepare the raw data for publication. The data entry workflow for collecting surface water microbe data as part of the water sampling is diagrammed in \autoref{fig:app}. An additional set of constraints are implemented during the process of ingest into the NEON database. The product-specific data constraint and validation requirements built into data entry applications and database ingest are described in the document `r ingestName`, provided with every download of this data product. Contained within this file is a field named 'entryValidationRulesForm', which describes syntactically the validation rules for each field built into the data entry application. Data entry constraints are described in Nicl syntax in the validation file provided with every data download, and the Nicl language is described in NEON's Ingest Conversion Language (NICL) specifications ([AD[15]). \begin{sidewaysfigure} \centering \includegraphics[width=22cm]{`r paste(myPathToGraphics,'SWC_DGA_AMC_app_UPDATED.jpg', sep='/')`} \caption{Schematic of the applications used by field technicians to enter water chemistry and surface water microbe field data} \label{fig:app} \end{sidewaysfigure} ## Automated Data Processing Steps Following data entry into a mobile application of web user interface, the steps used to process the data through to publication on the NEON Data Portal are detailed in the NEON Algorithm Theoretical Basis Document: OS Generic Transitions (AD[14]). Published data are reviewed for completeness, timeliness, and validity using an internal set of tests and metrics, as detailed in the NEON Algorithm Theoretical Basis Document: OS Data Quality Control (AD[16]). These quality tests are used to guide process improvements, audits of analytical facilities, and data updates, but do not generate quality flags in published data. ## Data Revision All data are provisional until a numbered version is released. Annually, NEON releases a static version of all or almost all data products, annotated with digital object identifiers (DOIs). The first data Release was made in 2021. During the provisional period, QA/QC is an active process, as opposed to a discrete activity performed once, and records are updated on a rolling basis as a result of scheduled tests or feedback from data users. The Issue Log section of the data product landing page contains a history of major known errors and revisions. ## Quality Flagging The **dataQF** field in each data record is a quality flag for known errors applying to the record. Please see the table below for an explanation of **dataQF** codes specific to this product. \begin{table}[H] \centering \caption{Descriptions of the dataQF codes for quality flagging} \label{tab:offset} \begin{tabular}{|m{1.6cm}|m{3.2cm}| m{8.9cm}|} \hline \bfseries{fieldName}& \bfseries{value} & \bfseries{definition} \\ \hline dataQF &legacyData & Data recorded using a paper-based workflow that did not implement the full suite of quality control features associated with the interactive digital workflow \\ \hline dataQF &incorrect 0.45 um pore size Sterivex filter may have been used, data may be skewed toward larger organisms & The incorrect Sterivex filter size (0.45 um instead of 0.22 um) may have been used for sampling but could not be determined definitively \\ \hline \end{tabular} \end{table} Records of land management activities, disturbances, and other incidents of ecological note that may have a potential impact are found in the Site Management and Event Reporting data product (DP1.10111.001) ## Analytical Facility Data Quality Data analyses conducted on aquatic microbe external lab data are captured in the data product tables for Surface water microbe marker gene sequences (DP1.20282.001), Surface water microbe metagenome sequences (DP1.20281.001), and Surface water microbe community taxonomy (DP1.20141.002). Data products Surface water microbe community composition (DP1.20141.001) and Surface water microbe group abundances (DP1.20278.001) have been discontinued. Information on sample collection methods such as frequencies per sample type can be found in the field user guides for each data product: - NEON User Guide to Microbe Marker Gene Sequences (DP1.10108.001; DP1.20280.001; DP1.20282.001) - NEON User Guide to Microbial Metagenome Sequences (DP1.10107.001; DP1.20279.001; DP1.20281.001) - NEON User Guide to Microbial Community Taxonomy (DP1.10081.002; DP1.20141.002; DP1.20086.002) - NEON User Guide to Microbial Community Composition (DP1.10081.001; DP1.20141.001; DP1.20086.001) - data product discontinued - NEON User Guide to Microbe Group Abundances (DP1.10109.001; DP1.20277.001; DP1.20278.001) - data product discontinued # REFERENCES Boulos, L., M. Prevost, B. Barbeau, J. Coallier, and R. Desjardins. 1999. LIVE/DEAD\textregistered \textit{Bac}Light\textsuperscript{TM}: application of a new rapid staining method for direct enumeration of viable and total bacteria in drinking water. Journal of Microbiological Methods 37: 77-86.