Single-Nucleotide Polymorphisms (SNPs)

DNA Wiki Topics
Denmark flag
DNA Companies
Public DNA Databases
Classes
DNA Methodology
  • Automated Clustering
  • Find a child you gave for adoption
  • Find Your Unknown Father
  • Chromosome Mapping
  • Find both of your parents if you are adopted
  • Visual Phasing
Scientific Details
Ethics
Example of DNA raw data

Understanding Your DNA Raw Data

All of the major genealogical companies allow you to download your DNA raw data so you can analyze it or upload it into other companies. If you open the raw data file you will see a lot of complicated letters and numbers. The goal of this wiki page is to explain what all of that information means.

General Format

A raw data file will typically have four or five columns. The first column records the "rsid." This is a code that the DNA company uses to record what SNP is being looked at. Each SNP has a unique combination here that is always rs and then a number. The second column will have a number 1-26. This is the number of the chromosome that the SNP is located on. The X chromosome can be referred to as X or 23, the Y chromosome can be referred to as Y or 24, the pseudo-autosomal region on the Y chromosome can be referred to as PAR or 24, and the mitochondrial DNA can be referred to as MT or 26. The third column will be a number and this records where on the chromosome the SNP is located. The first nitrogenous base on a chromosome is always 1, the next is 2, the next is 3, in order up into the millions. The higher the number, the farther along the chromosome the SNP is located. Since not all bases pairs are tested, you may notice large jumps in the numbers, but the numbers always get higher and higher until the end of the chromosome is reached. The fourth and fifth columns record the two values you have at each SNP. Some companies lump them both into one column (four column template) and some separate each into their own column (five column template).

Nitrogenous Bases (ATCG)

DNA chromosomes look like long twisty ladders. The longest chromosome (1) has over 249 million rungs and the smallest (21) has over 48 million. In total there are over 3 billion of these rungs in human DNA. Each rung in the ladder will contain a pair of nitrogenous bases: Adenine (A), Thymine (T), Cytosine (C), and Guanine (G). A and T are always paired together and C and G are paired together. Although two SNPs will always be together at each spot, only one of the two values at each spot will do any coding, the other is just a backbone that holds the structure together. The side that does the coding is called the + strand and the side that is the backbone is the - strand. Sometimes in an A T pair, the A will be the coding gene and the T will be the backbone, other times it will be the reverse and the same is true for C and G pairs. For simplicity, DNA companies will therefore just record the value of a person's + strand at each spot they test.

Single-Nucleotide Polymorphisms (SNPs)

Of the 3.2 billion base pairs we all have 90% are identical in all humans. Of the four nitrogenous bases (A, C, G, or T) each of those base pairs are homozygous meaning they can only have one of the four possible alleles. They simply make us human. The places where it is possible for a variance to occur are called SNPs which stands for Single-nucleotide polymorphisms. SNPs are the main force behind DNA and what gives it it's genealogical value. When two individuals have enough matching SNPs in a row, this becomes a matching segment. The more matching SNPs there are, the bigger the segment is. If a segment is big enough (morethan 15cm), then the segment must be identical by descend (IBD) which means the two individuals share that segment because they both descend from a common ancestor who passed on that segment of DNA to both of them. The more matching segments there and the bigger they are, the closer two test takers are probably related. By testing a sample of a person's SNPs and then comparing them to everyone else in the database, it is possible to identify a person's genetic relatives. Most major companies will test 500-700k SNPs.

In practice only two alleles are ever found at any one SNP the vast majority of the time. The allele that is most common is called the major allele the less common is called the minor allele. In autosomal DNA, each person will have two alleles at each spot, one inherited from their mother and one inherited from their father. This means that at each SNP tested a person can have one of three combinations:

1. Two copies of the major allele (homozygous)
2. Two copies of the minor allele (homozygous)
3. One copy of the major allele and one copy of the minor allele (heterozygous)

When two people have at least once matching allele at a SNP, it is a half match, if both SNPs match it is a full match, and if neither SNP matches it is a non match. Since there are only three possible combinations at each spot, most people will be either a full or half match at any one SNP by coincidence even though they are not related. In fact, somebody who is heterozygous will match everybody on earth at that spot. This is why it is important that hundreds or thousands of SNPs in a row match to be confident that a matching segment is truly identical by descent.

Phasing

Your raw data file lists your SNPS in the following order:

1. The leftmost SNP on chromosome 1 is listed first.
2. The next SNP listed is whatever SNP comes next moving from left to right
3. Once the last SNP (the rightmost SNP) is read the process repeats with the next chromosome

At each SNP, however, the order the allele's are nost listed in any particular order. AG, for example means the exact same thing as GA. It would be nice if the first allele was always from the father and the second always from the mother or vice versa but unfortunately it is a random mixture. When dealing with a heterozygous SNP the best way to know which allele is from which parent is to compare against other relatives. Sorting out the paternal and maternal SNPS is called phasing. In this situation, if you compared your DNA against your mom's and at the spot where you have C G, your mom has G G then you must have inherited the C from your dad and the G from your mom. If you are C G and your mom is also C G, then it is still unclear which allele came from which parent and comparing against your dad or another relative would be necessary to figure it out. Programs such as GedMatch.com offer the ability to phase your DNA by comparing it against one or both parents. Using phased kits reduces the amount of false segments identified between you and a match and is a valuable tool for people interested in small DNA segments. However, in cases where you and the parent being compared against are both heterozygous (like C G) the value becomes a no call and is discarded from the comparison. For this reason, comparing your DNA against both parents creates better results than just comparing against one. Perhaps in the spot where you and your mom are both C G, your father is C C, now it can be concluded you inherited the C from your father and the G from your mother.