Hubert Life: 2019

Home
Bioinformatics
- General
  - Basic
  - Clinical
- Programming
  - Python
  - Linux
- Statistics
Study
- Biology
- Math
- Computer
Media
- Photograph
- Video

Sunday, 31 March 2019

Phi X 174

What is it?

a single-stranded DNA(ssDNA) virus that infects Escherichia coli
the first DNA-based genome to be sequenced in 1977
Well-defined, small(5,396bp), and diverse(45% GC, 55% AT) genome
fasta file download link:

Using it as a positive control in Illumina NGS

What are benefits of using PhiX control?

Calibration Control: can be run alone and serves as a calibration control for;

Cluster generation: can be used as a positive control in the clustering process

Platform	Mode/Reagents	Optimal Raw Cluster Density
HiSeq	High Output, TruSeq v3	750-850 K/mm²
	High Output, HiSeq v4 (required upgrade)	950-1050 K/mm²
	Rapid v2	850-1,000 K/mm²
MiSeq	v2	1,000-1,200 K/mm²
MiSeq	v3	1,200-1,400 K/mm²
MiniSeq	Mid and High Output	170-220 K/mm²
NextSeq	Mid and High Output, v2	170-220 K/mm²

[table 1] Cluster density guidelines for Illumina sequencing platforms

Cross talk matrix generation

During an illumina sequencing run, the cross-talk due to spectral overlap between the 4 fluorescently labeled nucleotides is calculated during template generation in cycle 1-5
https://www.slideshare.net/idtdna/unique-dualmatched-adapters-mitigate-index-hopping-between-ngs-samples

Phasing and Prephasing

During sequencing by synthesis, each DNA strand in a cluster extends by 1 base per cycle
A small proportion of strands may become out of phase with the current cycle, either falling a base behind(phasing) or jumping a base ahead(prephasing)
For best results, use a PhiX spike-in as a control with any library that does not comprise a balanced base composition
High GC samples(≧ 60%) typically show higher phasing rates, and in this case a PhiX control is required

Run quality monitor: due to its small size and balanced nucleotide composition, it's an ideal in-run control (typically with >= 1% spike-in) for run quality monitoring

Platform	PhiX Aligned(%)
iSeq 100	minimum 5%
MiniSeq	10~50%
MiSeq (MCS 2.2 or higher)	minimum 5%
NextSeq	10~50%
HiSeq 2500 (HCS 2.2.38 or higher)	minimum 10%
HiSeq 3000/4000 (HCS 3.3.76 or lower)	10~50%
HiSeq 3000/4000 (HCS 3.4.0 or higher)	5~20%
NovaSeq	minimum 10%

[table 2] PhiX Control v3 library Illumina recommends spiking in when running low diversity libraries

Color balancing

For low diversity libraries, the PhiX Control v3 library provides balanced fluorescent signals at each cycle to improve the overall run quality
You can find why the nucleotide diversity is important in here

How to remove PhiX reads from the fastq

Phi X 174

What is it?

a single-stranded DNA(ssDNA) virus that infects Escherichia coli
the first DNA-based genome to be sequenced in 1977
Well-defined, small(5,396bp), and diverse(45% GC, 55% AT) genome
fasta file download link:

Using it as a positive control in Illumina NGS

What are benefits of using PhiX control?

Calibration Control: can be run alone and serves as a calibration control for;

Cluster generation: can be used as a positive control in the clustering process

Platform	Mode/Reagents	Optimal Raw Cluster Density
HiSeq	High Output, TruSeq v3	750-850 K/mm²
	High Output, HiSeq v4 (required upgrade)	950-1050 K/mm²
	Rapid v2	850-1,000 K/mm²
MiSeq	v2	1,000-1,200 K/mm²
MiSeq	v3	1,200-1,400 K/mm²
MiniSeq	Mid and High Output	170-220 K/mm²
NextSeq	Mid and High Output, v2	170-220 K/mm²

[table 1] Cluster density guidelines for Illumina sequencing platforms

Cross talk matrix generation

During an illumina sequencing run, the cross-talk due to spectral overlap between the 4 fluorescently labeled nucleotides is calculated during template generation in cycle 1-5
https://www.slideshare.net/idtdna/unique-dualmatched-adapters-mitigate-index-hopping-between-ngs-samples

Phasing and Prephasing

During sequencing by synthesis, each DNA strand in a cluster extends by 1 base per cycle
A small proportion of strands may become out of phase with the current cycle, either falling a base behind(phasing) or jumping a base ahead(prephasing)
For best results, use a PhiX spike-in as a control with any library that does not comprise a balanced base composition
High GC samples(≧ 60%) typically show higher phasing rates, and in this case a PhiX control is required

Run quality monitor: due to its small size and balanced nucleotide composition, it's an ideal in-run control (typically with >= 1% spike-in) for run quality monitoring

Platform	PhiX Aligned(%)
iSeq 100	minimum 5%
MiniSeq	10~50%
MiSeq (MCS 2.2 or higher)	minimum 5%
NextSeq	10~50%
HiSeq 2500 (HCS 2.2.38 or higher)	minimum 10%
HiSeq 3000/4000 (HCS 3.3.76 or lower)	10~50%
HiSeq 3000/4000 (HCS 3.4.0 or higher)	5~20%
NovaSeq	minimum 10%

[table 2] PhiX Control v3 library Illumina recommends spiking in when running low diversity libraries

Color balancing

For low diversity libraries, the PhiX Control v3 library provides balanced fluorescent signals at each cycle to improve the overall run quality
You can find why the nucleotide diversity is important in here

How to remove PhiX reads from the fastq

Nucleotide Diversity

What is nucleotide diversity and why is it important?

High nucleotide diversity: when a library has roughly equal proportions of all 4 nucleotides in every cycle of the run
The diagram below illustrates the diversity and base-balance of well-balanced and unbalanced libraries, and how that can be reflected in the % base plot of Sequencing Analysis Viewer(SAV)

[fig 1] Illustrates of the diversity and base-balance

Why is nucleotide diversity important?

Nucleotide diversity is required for effective template generation and is important for the generation of high-quality data
Diversity is especially important during the first 4-7 cycles of the first sequencing read for MiniSeq, MiSeq, NextSeq, and HiSeq 1000-2500 systems. The Sequencing software uses images from these early cycles to identify the location of each cluster in a process called template generation
Diversity is also important for the first 25 cycles because this is when phasing/pre-phasing, color matrix corrections, and the pass filter calculations occur
Real-Time Analysis(RTA) software need a proper PhiX is spiked-in. You can find more specific data in here

ref)

https://support.illumina.com/bulletins/2016/07/what-is-nucleotide-diversity-and-why-is-it-important.html

Nucleotide Diversity

What is nucleotide diversity and why is it important?

High nucleotide diversity: when a library has roughly equal proportions of all 4 nucleotides in every cycle of the run
The diagram below illustrates the diversity and base-balance of well-balanced and unbalanced libraries, and how that can be reflected in the % base plot of Sequencing Analysis Viewer(SAV)

[fig 1] Illustrates of the diversity and base-balance

Why is nucleotide diversity important?

Nucleotide diversity is required for effective template generation and is important for the generation of high-quality data
Diversity is especially important during the first 4-7 cycles of the first sequencing read for MiniSeq, MiSeq, NextSeq, and HiSeq 1000-2500 systems. The Sequencing software uses images from these early cycles to identify the location of each cluster in a process called template generation
Diversity is also important for the first 25 cycles because this is when phasing/pre-phasing, color matrix corrections, and the pass filter calculations occur
Real-Time Analysis(RTA) software need a proper PhiX is spiked-in. You can find more specific data in here

ref)

https://support.illumina.com/bulletins/2016/07/what-is-nucleotide-diversity-and-why-is-it-important.html

Two reasons for that

To reduce gap between sample variance and population variance
( empirical reason )

"1/n" version is the maximum likelihood estimate of the population variance, however, it is also mathematically biased
sample variance is usually smaller than the population variance
→ estimation of the population variance is getting bigger than real
to reduce gap using "1/n-1" convention ( provides an unbiased estimate )
why not n-2 ?

related to degree of freedom, that is n-1

To match both expectation of sample variances and population variance
( mathematical reason )

let,
$n$ : sample size
$\bar{X}$ : sample mean
$s^2$ : sample variance
$m$ : population mean
$\sigma^2$ : population variance
then, figure out following is true
$E[s^2] = 1/(n-1) E[\sum\limits_{k=1}^n (X{k}-\BAR{x})^2]$
first,
$\sum\limits_{k=1}^n (X{k}-\BAR{x})^2 = \sum\limits_{k=1}^n ((X{k}-m) + (m-\bar{X}))^2$
$\thickspace=\sum\limits_{k=1}^n ((X{k}-m)^2 + 2(X{k}-m)(m-\bar{X}) + (m-\bar{X})^2)$
$\thickspace=\sum\limits_{k=1}^n ((X{k}-m)^2 + 2(\bar{X}-m)n(m-\bar{X}) + n(m-\bar{X})^2$
$\thickspace=\sum\limits_{k=1}^n ((X{k}-m)^2 + 2(\bar{X}-m)n(\bar{X}-m) + n(\bar{X}-m)^2$
$\thickspace=\sum\limits_{k=1}^n ((X{k}-m)^2 -m(\bar{X}-m)^2$
$\therefore E[s^2] = 1/(n-1) E[\sum\limits_{k=1}^n (X{k}-\BAR{x})^2]$
$\thickspace= 1/(n-1) E[\sum\limits_{k=1}^n (X{k}-m)^2 - n(\BAR{x}-m)^2]$
as here,
$E[(X{k}-m)^2] = \sigma^2$
$E[(\bar{X}-m)^2] = V(\bar{X}) = \sigma^2/n$
$1/(n-1) E[\sum\limits_{k=1}^n (X{k}-m)^2 - n(\BAR{x}-m)^2] = 1/(n-1) * (n\sigma^2-n(\sigma^2/n)) = \sigma^2$
$\therefore s^2 = 1/(n-1) \sum\limits_{k=1}^n (X{k}-\bar{X})^2$

Estimating standard deviation: divide by n-1

Two reasons for that

To reduce gap between sample variance and population variance
( empirical reason )

"1/n" version is the maximum likelihood estimate of the population variance, however, it is also mathematically biased
sample variance is usually smaller than the population variance
→ estimation of the population variance is getting bigger than real
to reduce gap using "1/n-1" convention ( provides an unbiased estimate )
why not n-2 ?

related to degree of freedom, that is n-1

To match both expectation of sample variances and population variance
( mathematical reason )

let,
$n$ : sample size
$\bar{X}$ : sample mean
$s^2$ : sample variance
$m$ : population mean
$\sigma^2$ : population variance
then, figure out following is true
$E[s^2] = 1/(n-1) E[\sum\limits_{k=1}^n (X{k}-\BAR{x})^2]$
first,
$\sum\limits_{k=1}^n (X{k}-\BAR{x})^2 = \sum\limits_{k=1}^n ((X{k}-m) + (m-\bar{X}))^2$
$\thickspace=\sum\limits_{k=1}^n ((X{k}-m)^2 + 2(X{k}-m)(m-\bar{X}) + (m-\bar{X})^2)$
$\thickspace=\sum\limits_{k=1}^n ((X{k}-m)^2 + 2(\bar{X}-m)n(m-\bar{X}) + n(m-\bar{X})^2$
$\thickspace=\sum\limits_{k=1}^n ((X{k}-m)^2 + 2(\bar{X}-m)n(\bar{X}-m) + n(\bar{X}-m)^2$
$\thickspace=\sum\limits_{k=1}^n ((X{k}-m)^2 -m(\bar{X}-m)^2$
$\therefore E[s^2] = 1/(n-1) E[\sum\limits_{k=1}^n (X{k}-\BAR{x})^2]$
$\thickspace= 1/(n-1) E[\sum\limits_{k=1}^n (X{k}-m)^2 - n(\BAR{x}-m)^2]$
as here,
$E[(X{k}-m)^2] = \sigma^2$
$E[(\bar{X}-m)^2] = V(\bar{X}) = \sigma^2/n$
$1/(n-1) E[\sum\limits_{k=1}^n (X{k}-m)^2 - n(\BAR{x}-m)^2] = 1/(n-1) * (n\sigma^2-n(\sigma^2/n)) = \sigma^2$
$\therefore s^2 = 1/(n-1) \sum\limits_{k=1}^n (X{k}-\bar{X})^2$

Hubert Life

Pages

Sunday, 31 March 2019

Phi X 174

What is it?

What are benefits of using PhiX control?

How to remove PhiX reads from the fastq

Phi X 174

What is it?

What are benefits of using PhiX control?

How to remove PhiX reads from the fastq

Nucleotide Diversity

What is nucleotide diversity and why is it important?

Why is nucleotide diversity important?

Nucleotide Diversity

What is nucleotide diversity and why is it important?

Why is nucleotide diversity important?

Thursday, 28 March 2019

[A6000 + 30.4] Piazzale Michelangelo6

[A6000 + 30.4] Piazzale Michelangelo4

[A6000 + 30.4] Piazzale Michelangelo5

[A6000 + 30.4] Piazzale Michelangelo3

[A6000 + 30.4] Piazzale Michelangelo2

[A6000 + 30.4] Piazzale Michelangelo1

[A6000 + 30.4] Piazzale Michelangelo1

Wednesday, 27 March 2019

[A6000 + 30.4] Battistero di San Giovanni

[A6000 + 30.4] Ponte Vecchio

[A6000 + 30.4] Ttukseom Hangang Park5

[A6000 + 30.4] Ttukseom Hangang Park4

[A6000 + 30.4] Ttukseom Hangang Park3

[A6000 + 30.4] Ttukseom Hangang Park2

[A6000 + 30.4] Ttukseom Hangang Park1

Estimating standard deviation: divide by n-1

Two reasons for that

Estimating standard deviation: divide by n-1

Two reasons for that

About Me