GeneGrid enables you to quickly interpret SNVs and small indel variants from human sequencing data. Variants from targeted panels, exome and whole genome sequencing are annotated using a variety of annotation sources. You can filter the list for those annotated variants in the samples individually, perform trio analyses or compare case and control sets using multiple samples.
Some terms are appearing frequently in the user interface and are used throughout this manual. Here are the most important ones:
A VCF file may contain multiple samples in the additional FORMAT columns together with a sample identifier in the header column. In GeneGrid each column translates into a unique sample and the name from the header is assigned to it. This is also true if the same file with the same sample names is imported several times. Each sample is kept individually and previously uploaded samples are never overwritten.
The genotypes of samples can be compared among each other. This is a powerful task for studying disease or inheritance patterns. A Trio study is one example for such a comparison analysis. A VCF file may already contain such a comparison based on several samples. However, during the import step each sample is stored separately in GeneGrid. A comparison can be created afterwards in GeneGrid through the sample comparison task.
All kind of output data that is produced by an analysis step are summarized under the notion of a result. Sample and comparison are two examples of such a result. A sample is the output of the variant annotation whereas a comparison is the result for the sample comparison task.
GeneGrid helps you annotating and interpreting variants. Variants are listed for each sample and for each comparison and are the central type of result rows available in GeneGrid. Each genomic variant can affect multiple genes and have very distinct effects. Within GeneGrid, variants are separately analyzed for each gene context and a single genomic variant in the VCF file can result in multiple variants in the GeneGrid.
The first page that is displayed after a user signs in is the overview page. It serves as a central entry point leading to the four tasks variant annotation, sample comparison, Genome Browser and Pathway Analysis. Additionally, latest product updates and important announcements can be found at the bottom of this page.
This page lists all the samples that are available in GeneGrid. It is accessible through the menu or through the overview page.
Variants are shown in a table-based view where each row contains a single variant. Sample and comparison both contain variants as analysis results. Such result pages can be viewed from the samples page and the result management.
The central place to manage the results which can be either samples or comparisons is the result management. A result can be deleted, renamed or its name and comment can be edited. Additionally it shows the current storage state and how long it will be kept in the system.
GeneGrid follows a simple pricing model:
Cost directly translates to the number of results that are generated and how long this result is stored on the server, for details see Pricing. A result is either a sample that has been imported and annotated or a comparison analysis that has been created based on samples. This section outlines the different states such a result can run through the time. This is important because the storage and availability of the result are affected by the current state.
When an analysis finishes and the result is created, it immediately becomes visible on the result management page. In the initial state the result enters a grace period and no cost will occur at that time point. This grace period lasts 7 days and the remaining days for a result can be seen in the result management. During the grace period all data belonging to that result are stored in the system. Any time during the grace period it is possible to activate the result. If the result is not activated during the grace period it will be removed automatically without any further notice to the user.
By activating a result the user first purchases the result with credits available in the account. The user will only have to pay for the samples that the user activates. After activation the result can be used for further analysis steps. The result is automatically available for the next 30 days from the day of activation and can be used during that period for further analysis steps. After this period a storage fee is automatically charged and the result continuous to be available another 30 days. This repeats until the user either deletes the result or if the credits are not sufficient for the storage fee.
The main pages for analyzing variants have the same structure to make it easier get used to the workflow of the tasks. A page generally consists of:
Starting on the top with the header, the most important information regarding the current location path, navigation elements and user settings are situated here. From left to right the path is displayed and the steps are separated with the > (greater than) sign. The page you are viewing right now contains the following text for example: Documentation > User Manual. This is the place where you can find the name of your sample or the title of your comparison analysis if you are looking at a variant page. On the right side of the header the main menu, help menu and user menu are arranged in that order. The main menu stays always the same and contains the links to reach the overview page, samples page, Genome Browser, Pathway System and the result management. A basic summary of the result data is located at the bottom right in the footer. This includes the number of total rows, the number of rows that remain after the currently active filter setting and the number of rows that are currently viewed and retrieved from the database.
In contrast, the inner content changes according to the type of data that is currently shown. Moreover, it is divided into three sub panels:
1. The sidebar panel on the left provides a list of functions and tasks related to the result. Filtering result data is a task that all result views have in common. Other tasks and functions vary among the different views and depend on the type of data that is shown.
2. The data panel is the main table that lists the output data in rows. The total number of available rows for the current view can be seen at the footer, bottom right. By default, only a limited number of mostly used columns are visible at the beginning. Tables have a settings button which can be found in the leftmost column heading. This enables to show or hide the optional columns for the table.
3. By clicking on one of the rows in the table the details panel will be opened and background data are loaded and displayed in an extended view for the selected result row. It contains several smaller tables structured by source and type of annotation. Detailed information are referenced by a subset of columns from the data table. For instance, on the variants page the reference is either the gene or the genomic position of the selected row. In this case diseases are listed based on the gene of the selected variant.
The central task for all kind of results is to be able to filter the result rows. Filtering is available in two different flavors:
The quick filter is accessible through the column headings of the table. An input field for text input or a select box can be used to quickly filter the result rows down. All quick filter settings of the columns are combined together (all conditions have to be true) and have to pass the individual filter definition in order to be shown. The applied comparison operator for the column is by default set to the most appropriate comparison operator. The coverage column uses as default greater or equal whereas the allele frequencies are set to less or equal.
Sometimes it is required to perform a search using other comparison operators than the default ones or it may be necessary to apply multiple filters on the same column at the same time. For this purpose the advanced filter tool can be used. It is more flexible in terms of functionality and allowed filter complexity, however it is not accessible from the table directly and therefore requires some additional steps. The advanced filter tool can be found in the sidebar panel on the left side and is the very first task on top of the tool list. Here, a filter is defined by adding a filter condition and filling the three input fields:
The availability of operators depends on the type of data contained in the column. Textual columns have operators like contains and numeric columns operators like greater or less. In order to get to the initial view where all result rows are shown again, the filter can be reset at any time from the advanced filter tool.
Although it is in general possible to switch between the quick and advanced filter, some technical limitations apply. Since the quick filter is not capable of creating such complex filters as it is possible with the advanced filter tool, the quick filter will be disabled temporarily when the filter rules can't be displayed anymore accurately. This happens for instance when the comparison operator was changed from the default one or when several conditions apply to the same column. The advanced filter is not affected and is always accessible.
Each result row contains several columns that are filled with information that are related to this row. This is usually sufficient for browsing through the results. However, there is often much more background information and annotation data available. This type of data is collected in the details panel and can be retrieved by clicking on a result row. Tabs are used to structure the information based on annotation source and type into detail groups. If there is no content available for one of the detail groups, the tab is grayed out to indicate absent data. Otherwise the number of rows for this group is given right next to the name in the tab.
Variants for samples are imported from a VCF (Variant Call Format) file and can be optionally compressed with gzip (file name ends with vcf.gz). Version 4.1 and Version 4.2 of the VCF format specification are supported. The VCF file format accommodates many different needs to store a variety of data about the observed variation in a genome. A valid VCF file can contain zero or more samples stored in the genotpye or sample columns starting with column 10. The preceding column 9 is the FORMAT column and defines the sample-specific fields. In order to successfully annotate the variants, at least one sample needs to be defined, i.e. the file has at least 10 columns.
The following fields are required for the annotation:
All the other fields are optional and are not expected to be filled for the correct annotation. However, if present in the file, the following fields will be additionally imported and available for filtering:
Some variant calling pipelines store the allele-specific coverage in addition to the total read depth. These data fields are not yet as standardized as the other fields mentioned above. If any of these fields is present it will become available under ref coverage and alt coverage. The value in alt coverage is the sum of all depths across the listed alternative alleles. The fields for the sample-specific allele coverage are:
A common problem when comparing the genotypes for samples is incomplete information for genomic positions. Usually this occurs when the variants for the samples are called separately. It is practically impossible to decide if a given genomic position is reference or no-call if the position is not listed in the VCF file. The Genome VCF (gVCF) is a set of conventions to overcome this limitation. In a gVCF file the non-variant regions are listed in addition to the rows of a traditional variant file. This enables an upstream analysis tool to categorize each genomic position as either variant, reference or no-call. The distinction between reference and no-call is made through filter criteria set during gVCF creation. The filter criteria not only allows to recognize that this position was a no-call, but also to see due to which criteria (e.g. low coverage, low genotype quality score, conflicting predictions, or any other quality control setting) this decision was made.
GeneGrid supports gVCF files as input from the following variant calling pipelines:
For further information of the generation of gVCF files and their usage: gVCF Conventions
The comparison analysis takes advantage of these additional information provided in a gVCF file. The interpretation of the supplied filter criteria follows the simple rule set:
To start the import process, first, on the overview page, click on Variant Annotation. Then, on the next page, the section Import samples & annotate variants appears on the left side. Click on choose file/browse and select a VCF file from your local file system. The import process can take up to several hours depending on the number of variants in your input file. You'll be notified automatically by email as soon as the import and annotation process has finished and the samples are available.
Two optional pre-filters are available and can be used to restrict the import process to variants with prior conditions:
With the exome filter enabled only variants in the coding sequence or overlapping a splice site or splice region of coding transcripts are considered for annotation and are imported into GeneGrid. Other variants that do not match these criteria are skipped. Likewise with the minimum coverage filter all variants that have less than the specified coverage value are skipped. The default setting for the coverage filter is set to the value 1 which means variants that have a coverage value of zero are not imported. The coverage filter can only be applied if the input VCF file contains the sample-specific read depth (DP field in the genotype columns). This filter can be fully deactivated by setting the value to 0.
One advantage of the pre-filters is a small speed up at the import step because some variants are skipped and the total number of variants is therefore reduced. Less imported variants may also reduce the cost and allow for quicker comparisons later on. The value has to be chosen carefully if the the default one is not the preferred one because by skipping variants from your data set you might lose important information later on.
When a sample is imported into GeneGrid the SNVs and small indels (insertions and deletions) are automatically annotated. The annotation includes the transcript annotation, the existence of variants in public variant databases, the predicted effects and known allele frequencies in populations. GeneGrid integrates a variety of existing annotation sources including many public and proprietary databases, for example:
A complete list of all the sources together with version information can be found at Data Sources.
The single most influential source is the transcript and gene annotation. The selected transcript set directly determines the effect prediction algorithms for the coding sequence. GeneGrid uses a comprehensive transcript set based on RefSeq and Ensembl transcripts.
For the GRCh37 build, the transcript set is frozen to the following releases:
Transcript sets for GRCh38 build will be continuously updated to the latest RefSeq and Ensembl releases. The annotation will be maintained for both genome builds GRCh37 and GRCh38.
Every month an increasing number of new articles about genetic variants are published which are reported to have an impact on phenotypes. Abstracts of biomedical literature (PubMed) are analyzed through Genomatix literature mining (LitInspector). Several elements are extracted from the abstracts:
First, those information are then used to annotate the variants at the genomic position level. This quickly provides access through the column literature citations to the latest literature that mentions a genetic variation at that position which might assist in variant interpretation. Second, gene to disease association are derived based on the number of articles that reference both terms. These associations are ranked and listed in literature diseases and literature tissues.
Variant annotation always has to be seen as a snapshot of the existing annotation data at a given time point which provide context to the variants. Sources used for the annotation change over time and may significantly improve by correcting previously erroneous or conflicting entries or simply by adding new data. Not only the underlying transcript set gets updated from time to time, but also known variants in databases like dbSNP and ClinVar are constantly changing. Many publications are added each month to the biomedical literature that is used in our automated literature mining. In the context of variant interpretation it is important to apply the latest available annotation data at hand for the assessment of pathogenicity. Additionally, a reanalysis of previously reported variants can be of advantage when additional data or prediction models support a different assessment or provide additional layers of supporting evidence.
With the concept of continuous annotation variant annotation of imported samples is regularly updated to the latest releases of major annotation data sources or when new data sources are added.
How does continuous annotation improve the interpretation of my samples?
An email notification is sent when the import process is finished. If all steps succeeded the email contains a link to the list with the available samples. The available samples are also accessible through the samples page from the overview page or the main menu. An input file can have multiple samples. Each sample is reported separately in the list. Example: A single input file with 3 samples (columns 9-12 are supplied in the VCF file) will produce a total of 3 samples as result.
As recommendation, the first step should be to check the details for a sample. The details panels can be opened by just clicking on a row in the samples table. Variants are counted in GeneGrid in two different ways:
It is expected that both numbers do not have the same values. They differ by some degree and take into account the side effect, that a variant may affect multiple genes at the same genomic position. Therefore the variant is duplicated and is stored multiple times: one entry for each affected gene. This is necessary because the transcript effect may differ significantly among transcripts and genes. Therefore, the effect is reported individually at the cost of some additional result rows. This makes it easier to filter the variants and at the same time it is possible to keep track of the affected gene. To summarize, the number of variants in the database is the number that includes duplicated variants, whereas the number of variants in the source is the number that represents the entries in the source file.
There are a few reasons when a variant cannot be imported properly, e.g. if the genomic position was not found on the genome (due to genome build mismatch) or if the variant entry for a sample did not provide any genotype information. Please remember that genotype information is required for an analysis in GeneGrid. A warning is shown if such an event occurs during the import step. Additionally, if a variant entry in the source file could not be processed at all, the number of skipped variants is shown as well. The first 100 warnings can be seen in the details panel. A recommended way to assess the success of the import and annotation step is to look for skipped variants, warnings and the number of total variants.
Each imported sample needs to be activated before it is possible to look at the actual result rows. Activation is easy and can be initiated by just clicking on the activate button on the left side in the sample row.
The samples page provides two additional details tabs for each sample:
The quality control panel gives a brief summary about the data quality of the variants in the file. This includes metrics like the percentage of SNVs that were not found in dbSNP also known as dbSNP novelty rate, coverage and genotype quality summaries and others.
The annotation distribution panel gives a summary about the annotation process. It lists the number of variants that were not considered for annotation due to unknown genotype or reference sequence mismatches.
Both details panel can be accessed even if the sample has not been activated. The quality control serves as an initial control to decide if the variant calling step met certain standards in order to continue with variant interpretation. Whereas, the annotation distribution serves as a check point to see if the format of the variants was detected and if the variants could be annotated as expected. The statistics are calculated both genome-wide and per chromosome.
After the variants have been annotated and stored in the database, the data can now be analyzed and interpreted. Remember that if the input file contained multiple samples, each sample is stored individually in GeneGrid. If you are not interested at this point in comparing samples the next chapter discusses how to filter variants.
The purpose of a comparison study is to facilitate the interpretation task by integrating additional data sources for example from samples that are not affected or samples that represent some kind of replicates. GeneGrid supports three types of comparison studies:
In a Trio study an affected offspring is compared to the two parents as control set. Trio studies have two additional columns that are not present in other comparison studies:
Both columns are of special interest to characterize the source of variation in the affected offspring. The offspring inheritance summarizes the observed relative genotype in relation to the parents which can be: reference, heterozygous, homozygous, hemizygous and de novo. The second column indicates that the variant is part of a possible candidate set for compound heterozygosity.
The Cancer study can be applied to discover putative somatic mutations. This study always requires pairs of samples. There have to be the same amount of samples in case and control in order to perform such an analysis. Then, the samples are compared pairwise, usually a sample combined with a matched normal. In other words the first sample in the affected group is compared to the first sample in the non-affected group, the second affected sample is compared to the second non-affected sample. This study reveals as additional column, a flag that indicates if one of the pairs contains a somatic mutation. Here, a variant is considered a putative somatic mutation using a naive approach where the genotypes of the two paired samples must be different. Samples that have no genotype information at the compared position (no call) are skipped for somatic detection. No meaningful assumption can be made about the genotype in this case.
Everything else that does not fit into one of the two previously described study types, can be analyzed as other study. In this case it is still possible to get valuable information on how many samples differ between the two groups:
Samples can be selected only together if a minimum set common criteria are met:
A comparison study can be started through the task named sample comparison either from the overview page or the main menu. The page that lists all the available samples is shown. The section for the comparison task is opened automatically in the sidebar. First, the type of study has to be selected in order to continue. In the second step the samples need to be assigned to one of the two groups: case and control. In the case group samples are investigated under the prior phenotype assumption that they are affected. In contrast the control group is intended for non-affected samples. A sample can be assigned by dragging a result row in the right table and dropping it into one of the target groups. A yellow border appears as a visual guide that a sample can be dropped. Please be aware that you can only use samples that have been activated previously. After entering a study name the comparison analysis can be started.
Depending on the number of variants in the samples, the comparison analysis may take up to an hour. The status of the comparison is shown while the data are being processed. A finished analysis can be also accessed through the result management.
In general, filtering variants of a comparison result works in the same way as filtering variants of a single sample. The only difference is that some additional columns become available for a comparison while other columns that are very specific to the sample are removed or summarized. In this case the full range of columns can be still seen in the details panel including all the available columns for the sample. For an introduction and specific use case on how filtering can be applied to find and interpret disease-causing variants you might consider walking through the Tutorial.
It is possible to define a combination of filter rules for the same column. For instance, this allows to search at the same time for up to 10 genes or disease terms in a single filter setting.
Here is an example with multiple genes:
The filter rules are created by simply selecting a column from the add column select box. In this example, the gene column was added three times and the genes NMNAT1, RBM20 and TTN were defined as search criteria. Additionally the coverage column was added two times and the range between 30 and 400 reads was specified. The filter will be executed with the search button. All variants are shown that were annotated to NMNAT1 OR RBM20 OR TTN, while the coverage has to be greater than or equal to 30 AND less than or equal to 400.
The functionality of the exome pre-filter is also available after the import step. This filter flag allows to reduce the list of variants to the ones in the coding sequence or overlapping a splice site or splice region of coding transcripts.
The deleterious column resembles a consolidated flag which predicts that the variant might affect the splicing of the transcript or alters the amino acid sequence of a protein:
The value itself is a distilled version for the predicted effects column. The deleterious flag is set only if the genotype for the sample is not a reference call. It enables to quickly filter out variants that are synonymous and consider only variants that may be deleterious at the molecular level. This does not necessarily mean that the function of the gene has to be negatively affected or that it is even deleterious at the organism level.
Often the number of variants that are reported to have an impact on the protein at a molecular level (deleterious) is too broad. An additional column called low confidence allows to remove those variants that might have a higher uncertainty associated with. The criteria for low confidence are:
Genes have multiple transcripts with different products and therefore different protein changes arise from a variant. The consensus variation effect at the amino acid level is determined by a majority vote of the underlying transcript effects: * First, low confidence transcripts are excluded * A majority vote is performed on the remaining transcripts * In case of a tie, the effect of the longest transcript is selected as consensus effect
This enables to quickly filter for genomic positions where ClinVar has any information available. The significance summary is given in a basic color coding scheme for pathogenicity annotation from the ClinVar submitters. It's important to know that this information is shown solely based on the genomic position and does not necessarily reflect the actual variant allele. Therefore it is up to the user to compare it to the observed variant alleles in order to draw a conclusion about the variant itself. This can be done for instance by using the filter column for effect prediction.
Here is an example for the use of clinical significance filter:
The filled red boxes show that at this specific positions all submitted variants in ClinVar were annotated as pathogenic. If the box is only half-filled it means submitters showed discordance any there was least one who annotated a variant at this position with unknown or other significance.
Another helpful column for asserting a gene to disease association was added based on available diagnostic tests by molecular labs. The source for this annotation is the GTR (Genetic Testing Registry). It allows to see which disease were associated to specific genes by the molecular lab that developed the test. Usually such associations comprise diseases that are known to have a genetic cause. It is important to note that this association is at gene-level and does not necessarily mean that any observed variant has be associated with one of the listed diseases.
In this example it can be seen that there are 5 diagnostic tests available that have NMNAT1 associated with diseases like Leber congenital amaurosis and cone rod dystrophy.
Non-coding variants are annotated by overlapping variant regions with predicted regulatory features like enhancer, promoter and promoter flank. Those regulatory features are defined in the Ensembl Regulatory Build based on segmentation algorithms. In addition, the underlying experimental evidence for such regions is extracted from ChIP-Seq peaks, DNaseI hypersensitive sites and histone modification experiments. Genomatix' MatInspector is applied to find matched motifs within the ChIP-Seq peaks. A combined column for the number of regulatory evidences with 4 classes helps to quickly filter down a variant set without using the evidence columns individually:
Each experimental evidence layer is counted towards the class assignment only if tissue matches among the experiments.
Experimental details about the transcription factor binding sites with optional motif match and overlap with hypersensitive sites or histone modifications are available in the details panel for each region:
The Genome Browser allows to browse the human genome in context of your variants of interest and explore publicly available data. Each variant row in the data table has a link called Browse which leads directly to the Genome Browser view. The link appears in the leftmost column either by placing the mouse cursor over the row or after selecting the row with a click. Read more.
In addition to every annotated and activated sample it is possible to upload one BAM file. The samples page which lists all the available samples has a sidebar block to upload BAM files up to 50 GB. The BAM file can be used to view alignments at the read and coverage level in the Genome Browser.
In order to quickly find the uploaded BAM files again each of them needs to be associated with a sample that has already been activated. This association is specified before the upload in the same way as samples are assigned to one of the groups for comparison analyses. A column associated alignments indicates if a sample has an assigned BAM file. If the same sample is assigned again to a new BAM file the old association is automatically cleared and replaced with the one. Thereby, the previous BAM file will be deleted automatically from the server as well.
Please note, that a BAM file is deleted without further notice in the following events:
GeneGrid assists also with a gene enrichment analysis after variant filtering. Every filter step returns a list of variants. These variants can be mapped to genes. By default the genes are grouped and ordered automatically in the background, starting with the genes that contained the most variants. This list with the 10,000 most occurring genes can be passed to the Pathway Analysis (GePS). The sidebar provides that functionality in the section labeled Pathway Analysis. Before submitting the request it is possible to specify in advance the list of categories for the enrichment. Read more.