ISO/IEC 23092-2:2020 情報技術—ゲノム情報表現—パート2：ゲノム情報のコーディング

この規格プレビューページの目次

※一部、英文及び仏文を自動翻訳した日本語訳を使用しています。

3 用語と定義

このドキュメントの目的のために、ISO/IEC 23092-1 および以下に記載されている用語と定義が適用されます。

ISO と IEC は、次のアドレスで標準化に使用する用語データベースを維持しています。

3.1

アラインメント

配列［通常はシーケンシングリード（3.28）］と参照配列（参照ゲノムなど）との類似性を記述する情報。

注記1アラインメントは，レファレンス内の位置，レファレンスの鎖，および一連の編集操作（一致，不一致，挿入および削除，配列末端のクリッピングおよびスプライシング情報）に関して記述される。最初のシーケンスを 2 番目に変更します。

3.2

シガーストリング

葉巻

アライメントを表すテキストの方法 (3.1)

注記1複数の定義が異なるプログラムで使用されている。ここで言及されているのは、SAM フォーマットで使用されているものです。シーケンシングリードを参照に変換するために必要な一連の編集操作 (一致、不一致、挿入と削除、配列末端のクリッピング、およびスプライシング情報) をエンコードします。

3.3

データセット

参照シーケンスの 1 つまたは複数を含む圧縮単位。シーケンス読み取り (3.28) ;およびアライメント (3.1) 情報

注記 1:データセットは、ISO/IEC 23092-1 に規定されているとおりでなければならない。

3.4

消す

ゲノム配列からの 1 つまたは複数の塩基の連続除去

3.5

E シガー

CIGAR 構文のスーパーセットとして指定された拡張 CIGAR 構文

注記 1:とりわけ、E-CIGAR は、置換、スプライスされた読み取り、およびスプライスの孤立性の明確な表現を可能にします。

3.6

編集操作

置換，欠失（3.4），挿入（3.18）又はクリップによるヌクレオチド（3.20）の配列の修飾。

3.7

速い

各シーケンスリード (3.28) の名前とヌクレオチド (3.20) 配列を含む GIR

注記1:通常，追加情報はバイオインフォマティクスツール(データベース情報やベースコーリング情報など)によって読み取り識別子にエンコードされます。

3.8

FASTQ

FASTA (3.7) と品質値 (3.22) を含む GIR

3.9

最初の終わり

終了 1

読む 1

ペアエンドテンプレートの最初のセグメント (3.33)

注記 1: Illumina プラットフォームは通常、最初と 2 番目の端を 2 つの別々のファイルに同じ順序で保存します。つまり、最初の FASTQ ファイルの n 番目の読み取りと 2 番目の FASTQ ファイルの n 番目の読み取りは、同じテンプレートに属します。 .

3.10

ゲノム記述子

ディスクリプタ

ゲノムシーケンシングリード（3.28）の特徴、またはアラインメント（3.1）情報や品質値（3.22）などの関連情報を表すために使用される構文の要素。

3.11

ゲノム情報の表現

シーケンスとそれに関連するいくつかの情報を説明する方法

注記 1どの情報が表示されるかは、GIR によって異なります。

3.12

ゲノム記録

記録

オプションでアライメント（3.1）情報、読み取り識別子（3.24）、および品質値（3.22）に関連付けられたタプル（3.34）を表すデータ構造

3.13

ゲノムレコードインデックス

アクセスユニットにコード化されたゲノムレコード（3.12）のシーケンスにおけるゲノムレコードの位置。

3.14

ゲノム記録位置

ゲノムレコード (3.12) に含まれる最初のアラインメント (3.1) の参照ゲノム上でマッピングされた最も左の塩基の 0 ベースの位置

注記1：アラインされたリードに存在し、参照配列には存在しない塩基（挿入）、およびアラインメントプロセスによって保存されているが参照配列にマッピングされていない塩基（ソフトクリップ）には、マッピング位置がありません。

3.15

ゲノムリファレンス

リファレンス

参照配列のコレクション

注記1:典型的な例は参照ゲノムまたは参照トランスクリプトームです。

3.16

ハードクリップ

もともと読み取りの両側に存在し、アライメント後にそこから削除された塩基または塩基のセット (3.1)

注記 1塩基はもはや読み取り配列に存在しない。

3.17

インデル

2つの配列を整列させるとき，2つの配列を同じにするために，一方の配列に挿入されるか，他方の配列から削除される連続したヌクレオチド（3.20）。

注記1 「挿入又は削除」より。

3.18

挿入

ゲノム配列への 1 つまたは複数の塩基の連続付加

3.19

左端の読み取り終了

左端の読み取り

シーケンシングリード（3.28）ペアエンドシーケンシングランによって生成され、ペアの他のリードのマッピング位置よりも小さい参照配列上の位置にマッピングされたもの

3.20

ヌクレオチド

ベース

塩基対

DNAやRNAなどの核酸ポリマーのモノマー

注記1:ヌクレオチドは文字で表される(「A」はアデニン、「C」はシトシン、「G」はグアニン、「T」はDNAにのみ存在するチミン、「U」はDNAにのみ存在するウラシル) RNA）。特定の DNA または RNA 分子の化学式は、そのヌクレオチドの配列によって与えられます。 DNA, および RNA の場合はアルファベット ('A'、'C'、'G'、'U') 上の文字列。未知の分子組成を持つ塩基は「N」で示されます。

3.21

ペアエンド読み取り

ペアエンドテンプレート

2 つのセグメントで構成されるタプル (3.34)

注記 1:通常、セグメントは同じ核酸分子の開始点と終了点に対応します。

3.22

品質値

品質スコア

自動シーケンスプロセスで各ヌクレオチド (3.20) ベースコールに割り当てられた番号

注記 1:品質値は、ベースコールの正確さ、すなわち、配列内のヌクレオチドが誤って決定された確率 (または関連する尺度) を表します。

3.23

グループを読む

共通のプロパティを持つ読み取りのセット

3.24

読み取り識別子

ヘッダーを読む

名前を読む

FASTA (3.7) 、 FASTQ (3.8) および SAM (3.26) などの GIR に格納されている各シーケンシングリード (3.28) に関連付けられたテキスト文字列

注記 1:読み取り識別子は通常、そのデータセット内で一意であり、バイオインフォマティクスツールによってエンコードされた追加情報 (データベース情報や塩基呼び出し情報など) を含む場合があります。

3.25

右端の読み取り終了

一番右の読み取り

シーケンシングリード（3.28）ペアエンドシーケンスランによって生成され、ペアの他のリードのマッピング位置よりも大きい参照配列上の位置にマッピングされたもの。

3.26

サム

人間が判読できる GIR で、FASTQ とアラインメント (3.1) および分析情報が含まれています。

注記1：「Sequence Alignment/Map format」より。 SAM は 1000 Genome Sequencing Project に由来します。これはプレーン ASCII で表され、ユーザーによる拡張が可能で、シーケンス、品質、アラインメント、および分析情報が含まれています。

3.27

セカンドエンド

読む 2

ペアエンドテンプレートの 2 番目のセグメント (3.33)

注記1:シーケンシングプラットフォームは通常，最初と2番目の末端を2つの別々のファイルに同じ順序で保存します — すなわち，最初のFASTQファイルのn番目の読み取りと2番目のFASTQファイルのn番目の読み取りは同じテンプレートに属します.

3.28

シーケンス読み取り

読む

有機サンプルから抽出されたヌクレオチド（3.20）のセグメントの連続部分の，多かれ少なかれエラーを起こしやすい特定の技術による読み取り。

3.29

シングルエンド読み取り

1 つのセグメントで構成されるタプル (3.34)

3.30

ソフトクリップ

ソフトクリップベース

アラインメント (3.1) プロセス中に無視された読み取りの両側の塩基または塩基のセット

注記 1塩基は読み取り配列内にまだ存在している。

3.31

スプライスされた読み取り

アラインリード:生物学的スプライシングの結果として、生物学的スプライシングの結果である参照ゲノムの非連続部分をカバーする

注記 1:これは、読み取りが RNA シーケンシングに由来し、2 つの連続するエクソン間に少なくとも 1 つのジャンクションが含まれている必要があることを意味します。

3.32

分割線形

両端が 2 つの異なるゲノムレコード (3.12) にエンコードされている、整列されたペアエンドリード (3.21 )

3.33

テンプレート

シーケンスマシンによって単一のユニットとして生成されるゲノム配列

注記 1:テンプレートは 1 つまたは複数のセグメントで作成できます (セグメントが 1 つしかない場合はシングルエンドシーケンスリードと呼ばれ、セグメントが 2 つある場合はペアエンドシーケンスリードと呼ばれます。核酸分子の末端）。

3.34

タプル

1 つまたは複数のセグメントのコレクション

注記 1各セグメントは次の場合があります。一度マップされます。または複数回マップされています。

3.35

解読されたゲノム記述子

1 つまたは複数の記述子サブシーケンス（3.36）のデコードされたシンボル（3.37）を多重化した結果。

3.36

記述子サブシーケンス

デコードされたシンボルの順序付きコレクション (3.37)

3.37

デコードされたシンボル

記述子シーケンスを再構築するために必要な値 (3.36)

注記1逆部分列変換が適用されない場合，変換された記号は復号化された記号と等しくなければならない。

3.38

変換されたシーケンス

変換されたシンボルの順序付きコレクション (3.39)

注記 1つ以上の変換されたサブシーケンスの変換されたシンボルを多重化して、復号化されたシンボルを生成することができます。

3.39

変形したアイコン

1 つまたは複数のデコードされたサブシンボルの連結 (3.40)

3.40

デコードされたサブシンボル

変換されたサブシンボル (3.41) に適用された逆サブシンボル変換の出力

注記12.6.2.7 節参照。逆サブシンボル変換が適用されない場合、デコードされたサブシンボルは変換されたサブシンボルと等しくなります。

3.41

変換されたサブシンボル

デコードされたキャバックサブシンボル

cabac 解読プロセスによって生成される原子値

3 Terms and definitions

For the purposes of this document, the terms and definitions given in ISO/IEC 23092-1 and the following apply.

ISO and IEC maintain terminological databases for use in standardization at the following addresses:

3.1

alignment

information describing the similarity between a sequence [typically a sequencing read (3.28) ] and a reference sequence (for instance, a reference genome)

Note 1 to entry: An alignment is described in terms of a position within the reference, the strand of the reference, and a set of edit operations (matches, mismatches, insertions and deletions, clipping of the sequence ends and splicing information) needed to turn the first sequence into the second.

3.2

CIGAR string

CIGAR

textual way of representing an alignment (3.1)

Note 1 to entry: Several definitions have been used by different programs; the one referred to here is the one used in the SAM format. It encodes a set of edit operations (matches, mismatches, insertions and deletions, clipping of the sequence ends and splicing information) needed to turn the sequencing read into the reference.

3.3

dataset

compression unit containing one or more of: reference sequences; sequencing reads (3.28) ; and alignment (3.1) information

Note 1 to entry: Datasets shall be as specified in ISO/IEC 23092-1.

3.4

deletion

contiguous removal of one or more bases from a genomic sequence

3.5

E-CIGAR

extended CIGAR syntax specified as a superset of the CIGAR syntax

Note 1 to entry: Among other things, E-CIGAR enables the unambiguous representation of substitutions, spliced reads and splice strandedness.

3.6

edit operation

modification of a sequence of nucleotides (3.20) by means of a substitution, deletion (3.4) , insertion (3.18) or clip

3.7

FASTA

GIR that includes a name and a nucleotide (3.20) sequence for each sequencing read (3.28)

Note 1 to entry: Additional information is usually encoded in the read identifier by bioinformatics tools (such as database information, and base calling information).

3.8

FASTQ

GIR that includes FASTA (3.7) and quality values (3.22)

3.9

first end

end 1

first segment of a paired-end template (3.33)

Note 1 to entry: Illumina platforms usually store first and second ends in two separate files and in the same order — i.e. the n-th read of the first FASTQ file and the n-th read of the second FASTQ file belong to the same template.

3.10

genomic descriptor

descriptor

element of the syntax used to represent a feature of a genomic sequencing read (3.28) or associated information such as alignment (3.1) information or quality values (3.22)

3.11

genomic information representation

way to describe a sequence and some information associated with it

Note 1 to entry: Which information is represented varies depending on the GIR.

3.12

genomic record

record

data structure representing a tuple (3.34) optionally associated with alignment (3.1) information, read identifier (3.24) and quality values (3.22)

3.13

genomic record index

position of a genomic record in the sequence of genomic records (3.12) encoded in an access unit

3.14

genomic record position

0-based position of the leftmost mapped base on the reference genome of the first alignment (3.1) contained in a genomic record (3.12)

Note 1 to entry: A base present in the aligned read and not present in the reference sequence (insertion) and bases preserved by the alignment process but not mapped on the reference sequence (soft clips) do not have mapping positions.

3.15

genomic reference

reference

collection of reference sequences

Note 1 to entry: Typical examples are a reference genome or a reference transcriptome.

3.16

hard clip

base or set of bases originally present at either side of a read, and removed from it following alignment (3.1)

Note 1 to entry: The bases are no longer present in the sequence of the read.

3.17

indel

contiguous stretch of nucleotides (3.20) that, when aligning two sequences, are inserted into one sequence, or alternatively deleted from the other, in order to make the two sequences the same

Note 1 to entry: From “insertion or deletion”.

3.18

insertion

contiguous addition of one or more bases into a genomic sequence

3.19

leftmost read end

leftmost read

sequencing read (3.28) generated by a paired-end sequencing run and mapped at a position on the reference sequence which is smaller than the mapping position of the other read in the pair

3.20

nucleotide

base

base pair

monomer of a nucleic acid polymer such as DNA or RNA

Note 1 to entry: Nucleotides are denoted as letters (‘A’ for adenine; ‘C’ for cytosine; ‘G’ for guanine; ‘T’ for thymine which only occurs in DNA; and ‘U’ for uracil which only occurs in RNA). The chemical formula for a specific DNA or RNA molecule is given by the sequence of its nucleotides, which can be represented as a string over the alphabet (‘A’, ’C’, ’G’, ‘T’) in the case of DNA, and a string over the alphabet (‘A’, ‘C’, ‘G’, ‘U’) in the case of RNA. Bases with unknown molecular composition are denoted with ‘N’.

3.21

paired-end read

paired-end template

tuple (3.34) made of two segments

Note 1 to entry: Typically the segments correspond to the beginning and the end of the same nucleic acid molecule.

3.22

quality value

quality score

number assigned to each nucleotide (3.20) base call in automated sequencing processes

Note 1 to entry: Quality values express the base-call accuracy, i.e. the probability (or a related measure) for a nucleotide in the sequence to have been incorrectly determined.

3.23

read group

set of reads having some property in common

3.24

read identifier

read header

read name

text string associated with each sequencing read (3.28) stored in GIRs such as FASTA (3.7) , FASTQ (3.8) and SAM (3.26)

Note 1 to entry: The read identifier is usually unique within its dataset, and may contain additional information as encoded by bioinformatics tools (such as database information, and base calling information).

3.25

rightmost read end

rightmost read

sequencing read (3.28) generated by a paired-end sequencing run and mapped at a position on the reference sequence which is greater than the mapping position of the other read in the pair

3.26

SAM

GIR that is human readable and includes FASTQ plus alignment (3.1) and analysis information

Note 1 to entry: From “Sequence Alignment/Map format”. SAM originates from the 1000 Genome Sequencing Project. It is represented in plain ASCII, extensible by users and includes sequence, quality, alignment and analysis information.

3.27

second end

second segment of a paired-end template (3.33)

Note 1 to entry: Sequencing platforms usually store first and second ends in two separate files and in the same order — i.e. the n-th read of the first FASTQ file and the n-th read of the second FASTQ file belong to the same template.

3.28

sequencing read

read

readout, by a specific technology more or less prone to errors, of a continuous part of a segment of nucleotides (3.20) extracted from an organic sample

3.29

single-end read

tuple (3.34) made of one segment

3.30

soft clip

soft clipped bases

base or set of bases at either side of the read that have been ignored during the alignment (3.1) process

Note 1 to entry: The bases are still present in the sequence of the read.

3.31

spliced read

aligned read which, as a consequence of biological splicing, covers non-continuous portions of the reference genome being the result of biological splicing

Note 1 to entry: This means the read must come from RNA-sequencing, and contain at least one junction between two consecutive exons.

3.32

split alignment

aligned paired-end read (3.21) whose ends are encoded in two different genomic records (3.12)

3.33

template

genomic sequence that is produced by a sequencing machine as a single unit

Note 1 to entry: A template can be made of one or more segments (being called single-end sequencing read when it only has one segment, and paired-end sequencing read when it has two segments — typically they capture both the beginning and the end of a nucleic acid molecule).

3.34

tuple

collection of one or more segments

Note 1 to entry: Each segment can be: unmapped; mapped once; or mapped more than once.

3.35

decoded genomic descriptor

result of multiplexing the decoded symbols (3.37) of one or more descriptor subsequences (3.36)

3.36

descriptor subsequence

ordered collection of decoded symbols (3.37)

3.37

decoded symbol

value needed to reconstruct a descriptor subsequence (3.36)

Note 1 to entry: If no inverse subsequence transformation is applied, the transformed symbol shall be equal to the decoded symbol.

3.38

transformed subsequence

ordered collection of transformed symbols (3.39)

Note 1 to entry: The transformed symbols of one or more transformed subsequences can be multiplexed to yield decoded symbols.

3.39

transformed symbol

concatenation of one or more decoded subsymbols (3.40)

3.40

decoded subsymbol

output of an inverse subsymbol transformation applied on a transformed subsymbol (3.41)

Note 1 to entry: See subclause 12.6.2.7. If no inverse subsymbol transformation is applied, the decoded subsymbol shall be equal to the transformed subsymbol.

3.41

transformed subsymbol

decoded cabac subsymbol

atomic value yielded by the cabac decoding process

ISO/IEC 23092-2:2020 情報技術—ゲノム情報表現—パート2：ゲノム情報のコーディング | ページ 6

3 用語と定義

3 Terms and definitions

ISO PDF プレビュー