ISO/IEC 23092-6:2023 情報技術

この規格プレビューページの目次

※一部、英文及び仏文を自動翻訳した日本語訳を使用しています。

導入

ISO/IEC 23092-1 から ISO/IEC 23092-5 (MPEG-G) までは、ハイスループットシーケンシング (HTS) データの一次解析から得られるゲノム情報の表現、つまりシーケンシングリードと品質、およびそれらのアラインメントを扱います。参照ゲノム – これは長いシリーズの最初のステップにすぎません。特に、一次分析の結果は通常、より高度な情報を取得するためにさらに処理されます。単一のリードとそのゲノムとのアライメントから推定される情報を集約して、より複雑な結果を得るこのようなプロセスは、一般に二次解析として知られています。ほとんどの HTS ベースの生物学的研究では、二次解析の出力は通常、参照配列上の 1 つ以上のゲノム間隔に関連付けられたさまざまなタイプのアノテーションとして表されます。

生物学的研究では通常、マッピング統計、定量的ブラウザートラック、バリアント、ゲノム機能注釈、遺伝子発現データ、Hi-C 接触行列などのゲノム注釈データが生成されます。これらの多様なタイプの下流ゲノムデータは現在、VCF, BED, WIG などの異なる形式で表現されており、セマンティクスが大まかに定義されているため、相互運用性の問題、形式間の頻繁な変換の必要性、複数のデータの視覚化の難しさなどが生じています。モーダルデータと複雑な情報交換。図 1 は、HTS データの一次および二次分析の一般的なパイプライン、関連するファイル形式、および ISO/IEC 23092 シリーズのさまざまな部分の範囲を示しています。

さらに、単一のフォーマットが存在しないため、圧縮アルゴリズムの開発が滞り、パフォーマンスが最適ではない一般的な圧縮アルゴリズムが広く使用されるようになりました。これらのアルゴリズムは、注釈データが通常、異なる統計特性を持つ複数のフィールド (属性) で構成されているという事実を利用せず、代わりにそれらをまとめて圧縮します。したがって、これらのアルゴリズムはゲノムの位置に関する効率的なランダムアクセスをサポートしますが、ファイル全体をすべて解凍することなく特定のフィールドを抽出することはできません。

前述の課題に応えて、この文書では、ファイルストレージまたはデータ転送用のさまざまなゲノムアノテーションデータを効率的に表現および圧縮するための統一データフォーマットについて詳しく説明します。その利点は多岐にわたります。データストレージのコストの削減、ランダムデータアクセスと処理の速度の向上、選択的なゲノム領域でのデータセキュリティとプライバシーのサポートの提供、さまざまな種類のゲノムアノテーションとシーケンスデータ間の連携の確立です。最終的な目標は、データの操作と管理の負担を軽減するために、マルチモーダルなゲノムデータの安全かつシームレスな共有、処理、分析を可能にし、科学者が生物学的な解釈と発見に集中できるようにすることです。

図 1 — HTS データの一次および二次分析の一般的なパイプライン

Key

	シーケンスにより生の読み取りが生成されます
	読み取りアライメント
	バリアント呼び出し
	バリアントの注釈
	分析

Introduction

While ISO/IEC 23092-1 to ISO/IEC 23092-5 (MPEG-G) deal with the representation of genomic information derived from the primary analysis of high-throughput sequencing (HTS) data – sequencing reads and qualities, and their alignment to a reference genome – which is only the first step in a long series. In particular, the results of primary analysis are usually processed further in order to obtain higher-level information. Such a process of aggregating information deduced from single reads and their alignments to the genome into more complex results is generally known as secondary analysis. In most HTS-based biological studies, the output of secondary analysis is usually represented as different types of annotations associated to one or more genomic intervals on the reference sequences.

Biological studies typically produce genomic annotation data such as mapping statistics, quantitative browser tracks, variants, genome functional annotations, gene expression data and Hi-C contact matrices. These diverse types of downstream genomic data are currently represented in different formats such as VCF, BED, WIG, etc., with loosely defined semantics, leading to issues with interoperability, the need for frequent conversions between formats, difficulty in the visualization of multi-modal data and complicated information exchange. Figure 1 depicts a typical pipeline for the primary and secondary analyses of HTS data, the file formats involved and the scopes of different parts of the ISO/IEC 23092 series.

Furthermore, the lack of a single format has stifled the work on compression algorithms and has led to the widespread use of general compression algorithms with suboptimum performance. These algorithms do not exploit the fact the annotation data typically comprises of multiple fields (attributes) with different statistical characteristics and instead compress them together. Therefore, while these algorithms support efficient random access with respect to genomic position, they do not allow extraction of specific fields without decompressing all the whole file.

In response to the aforementioned challenges, this document details a unified data format for the efficient representation and compression of diverse genomic annotation data for file storage or data transport. The benefits are manifold: reducing the cost of data storage, improving the speed of random data access and processing, providing support for data security and privacy in selective genomic regions, and creating linkages across different types of genomic annotation and sequencing data. The ultimate goal is to enable the secured and seamless sharing, processing and analysis of multi-modal genomic data in order to reduce the burden of data manipulation and management, so scientists can focus on biological interpretation and discovery.

Figure 1 — Typical pipeline for the primary and secondary analyses of HTS data

Key

	sequencing generates raw reads
	read alignment
	variant calling
	variants annotations
	analysis

ISO/IEC 23092-6:2023 情報技術 | ページ 3

導入

Introduction

ISO PDF プレビュー