ISO/IEC 23092-3:2022 情報技術 — ゲノム情報の表現 — Part 3: メタデータとアプリケーションプログラミングインターフェイス (API)

この規格プレビューページの目次

※一部、英文及び仏文を自動翻訳した日本語訳を使用しています。

序章

ハイスループットシーケンス (HTS) 技術の出現により、生物学的研究から診療所での個別化されたゲノム医療に至るまで、日常業務におけるゲノム情報の採用が促進される可能性があります。その結果、生成されるデータの量はここ数年で劇的に増加し、近い将来にはさらに顕著な増加が予想されます。

現時点では、ゲノム情報は主にさまざまなデータ形式を通じて交換されます。たとえば、アラインされていないシーケンシングリード用の FASTA/FASTQ やアラインリード用の SAM/BAM/CRAM などです。このようなフォーマットに関して、ISO/IEC 23092 シリーズは、以下によってゲノム配列決定情報の表現と圧縮のための新しいソリューションを提供します。

直接実装された特定の形式ではなく、配列決定データの抽象的な表現を指定する。
テクノロジーとユースケースがより成熟した時点で設計されています。これにより、テキスト SAM 形式の 1 つの制限に対処することができます。この形式では、年月をかけて機能がその場しのぎで追加され、全体的に冗長で次善の形式になり、同時に一般的ではなく、不必要に複雑になりました。
規範的なゲノムデータ表現から、明確なセマンティクスを持たない自由フィールドのユーザー定義情報を規範的に分離します。これにより、異なるデータプロデューサー間で完全に相互運用可能な自動的な情報交換が可能になります。
データとメタデータは異なる概念レベルで分割されているため、関連するメタデータ情報をデータと多重化できます。
過去 30 年間、トランスポートフォーマット、ファイルフォーマット、圧縮表現、およびアプリケーションプログラムインターフェイスのデジタルメディアの分野で成功を収めてきた、厳密で監督された開発プロセスに従っています。

このドキュメントは、コミュニティがゲノム情報処理の分野で斬新で相互運用可能なソリューションのエコシステムを作成できるようにする実現技術を提供します。特に、以下を提供します。

シーケンスおよびアラインメント情報を保存するための、一貫性のある一般的で適切に設計されたフォーマット定義とデータ構造。さまざまな圧縮アルゴリズムを実装するための基盤として使用できる堅牢なフレームワーク。
新しく設計されたデータクラスタリングと最適化されたストレージ方法論による、コード化されたデータへの選択的アクセスの速度と柔軟性。
リアルタイムのアプリケーションドメインに着想を得た伝送プロトコルに基づいて、データ伝送の遅延が少なくなり、結果として遠隔地での高速な可用性が得られます。
データ階層のすべてのレイヤーでカスタマイズ可能な安全なアクセスを可能にする柔軟なフレームワークのおかげで、組み込みのプライバシーと機密情報の保護。
技術の信頼性と、ツールとシステム間の相互運用性。これは、網羅的なデータセットで規格への適合性を評価するための規範的な手順の提供によるものです。
仕様の全体をカバーする規範的な参照実装の可用性を通じて、準拠したデバイスとアプリケーションの完全なエコシステムの実装をサポートします。

ISO/IEC 23092 シリーズのデータ表現の基本構造はゲノムレコードです。ゲノムレコードは、単一のシーケンスリードまたはペアシーケンスリードのいずれか、およびそれに関連するシーケンスおよびアラインメント情報で構成されるデータ構造です。詳細なマッピングとアラインメントデータ、単一またはペアの読み取り識別子 (読み取り名)、および品質値が含まれる場合があります。

従来のアプローチを壊すことなく、ISO/IEC 23092 シリーズで導入されたゲノムレコードは、単純なシーケンスデータから高度なアラインメント情報まで、単一の DNA テンプレートに関連するすべての情報をグループ化する、よりコンパクトでシンプルで管理しやすいデータ構造を提供します。

ゲノムレコードは、コード化された情報の相互作用と操作に適した論理データ構造ですが、圧縮に適した原子データ構造ではありません。高い圧縮率を実現するには、ゲノムレコードをクラスターにグループ化し、同じタイプの情報を均一なブロックに構造化された一連の記述子に変換する必要があります。さらに、選択的なデータアクセスを扱う場合、ゲノムレコードは小さすぎて効果的かつ迅速な情報検索を行うことができません。

これらの理由から、このドキュメントでは、アクセスユニットの概念を導入します。アクセスユニットは、圧縮ドメイン内の情報へのコーディングおよびアクセスの基本的な構造です。

アクセスユニットは、ISO/IEC 23092-2 に準拠したデコーダーでデコードできる最小のデータ構造です。アクセスユニットは、そのゲノムレコードの情報を表すために使用される記述子ごとに 1 つのブロックで構成されます。したがって、ブロックペイロードは、クラスタ内の同じタイプ (記述子) のすべてのデータのコード化された表現です。

アクセスユニットに圧縮されたゲノムレコードのクラスターに加えて、読み取りはさらに 6 つのデータクラスに分類されます。 6 番目のクラスには、マッピングできなかった読み取りまたは生のシーケンスデータのいずれかが含まれます。シーケンスリードをクラスに分類することで、強力な選択的データアクセスを開発できます。実際、アクセスユニットは、それらを構成するゲノムレコードから特定のデータ特性 (たとえば、クラス P での完全一致、クラス M での置換、クラス I でのインデル、クラス HM でのハーフマップリード) を継承し、したがって、可能なデータ構造を構成します。多くの異なるユースケースを効率的にサポートするための強力なフィルタリング機能を提供します。

アクセスユニットは、コンテンツ保護とメタデータの関連付けの観点から、基本的で最も細かいデータ構造です。つまり、各アクセスユニットを個別に独立して保護できます。図 1 は、アクセスユニット、ブロック、およびゲノムレコードが、ISO/IEC 23092 シリーズのデータ構造でどのように相互に関連しているかを示しています。

図 1 —アクセスユニット、ブロック、およびゲノムレコード

図 2 —高レベルのデータ構造: データセットとデータセットグループ

データセットは、ヘッダーと 1 つ以上のアクセスユニットを含むコード化されたデータ構造です。典型的なデータセットには、たとえば、個人の完全なシーケンス、またはその一部が含まれている可能性があります。他のデータセットには、たとえば、参照ゲノムまたはその染色体のサブセットが含まれる場合があります。図 2 に示すように、データセットはデータセットグループにグループ化されます。

データセットのデコードプロセスの簡略図を図 3 に示します。

図 3 —解読プロセス

国際標準化機構 (ISO) および国際電気標準会議 (IEC) は、この文書への準拠には特許の使用が含まれる可能性があると主張しているという事実に注目しています。

ISO および IEC は、この特許権の証拠、有効性、および範囲に関していかなる立場も取りません。

この特許権の所有者は、ISO および IEC に対して、合理的かつ非差別的な条件の下で、世界中の申請者とライセンスを交渉する意思があることを保証しています。この点、本特許権者の陳述書は ISO および IEC に登録されています。情報は、 www.iso.org/patents で入手できる特許データベースから入手できます。

このドキュメントの一部の要素が、特許データベース内のもの以外の特許権の対象である可能性があることに注意してください。 ISO および IEC は、そのような特許権の一部またはすべてを特定する責任を負わないものとします。

Introduction

The advent of high-throughput sequencing (HTS) technologies has the potential to boost the adoption of genomic information in everyday practice, ranging from biological research to personalized genomic medicine in the clinic. As a consequence, the volume of generated data has increased dramatically during the last few years, and an even more pronounced growth is expected in the near future.

At the moment, genomic information is mostly exchanged through a variety of data formats, such as FASTA/FASTQ for unaligned sequencing reads and SAM/BAM/CRAM for aligned reads. With respect to such formats, the ISO/IEC 23092 series provides a new solution for the representation and compression of genome sequencing information by:

specifying an abstract representation of the sequencing data rather than a specific format with its direct implementation;
being designed at a time point when technologies and use cases are more mature. This permits the addressing of one limitation of the textual SAM format, for which incremental ad-hoc addition of features followed along the years, resulting in an overall redundant and suboptimal format which at the same time results not general and unnecessarily complicated;
normatively separating free-field user-defined information with no clear semantics from the normative genomic data representation. This allows a fully interoperable and automatic exchange of information between different data producers;
allowing multiplexing of relevant metadata information with the data since data and metadata are partitioned at different conceptual levels;
following a strict and supervised development process which has proven successful in the last 30 years in the domain of digital media for the transport format, the file format, the compressed representation and the application program interfaces.

This document provides the enabling technology that will allow the community to create an ecosystem of novel, interoperable solutions in the field of genomic information processing. In particular, it offers:

consistent, general and properly designed format definitions and data structures to store sequencing and alignment information. A robust framework which can be used as a foundation to implement different compression algorithms;
speed and flexibility in the selective access to coded data, by means of newly designed data clustering and optimized storage methodologies;
low latency in data transmission and consequent fast availability at remote locations, based on transmission protocols inspired by real-time application domains;
built-in privacy and protection of sensitive information, thanks to a flexible framework which allows customizable secured access at all layers of the data hierarchy;
reliability of the technology and interoperability among tools and systems, owing to the provision of a normative procedure to assess conformance to the standard on an exhaustive dataset;
support to the implementation of a complete ecosystem of compliant devices and applications, through the availability of a normative reference implementation covering the totality of the specification.

The fundamental structure of the ISO/IEC 23092 series data representation is the genomic record. The genomic record is a data structure consisting of either a single sequence read, or a paired sequence read, and its associated sequencing and alignment information; it may contain detailed mapping and alignment data, a single or paired read identifier (read name) and quality values.

Without breaking traditional approaches, the genomic record introduced in the ISO/IEC 23092 series provides a more compact, simpler and manageable data structure grouping all the information related to a single DNA template, from simple sequencing data to sophisticated alignment information.

The genomic record, although it is an appropriate logic data structure for interaction and manipulation of coded information, is not a suitable atomic data structure for compression. To achieve high compression ratios, it is necessary to group genomic records into clusters and to transform the information of the same type into sets of descriptors structured into homogeneous blocks. Furthermore, when dealing with selective data access, the genomic record is a too small unit to allow effective and fast information retrieval.

For these reasons, this document introduces the concept of access unit, which is the fundamental structure for coding and access to information in the compressed domain.

The access unit is the smallest data structure that can be decoded by a decoder compliant with ISO/IEC 23092-2. An access unit is composed of one block for each descriptor used to represent the information of its genomic records; therefore, a block payload is the coded representation of all the data of the same type (i.e. a descriptor) in a cluster.

In addition to clusters of genomic records compressed into access units, reads are further classified in six data classes: five classes are defined according to the result of their alignment against one or more reference sequences; the sixth class contains either reads that could not be mapped or raw sequencing data. The classification of sequence reads into classes enables to develop powerful selective data access. In fact, access units inherit a specific data characterization (e.g. perfect matches in Class P, substitutions in Class M, indels in Class I, half-mapped reads in Class HM) from the genomic records composing them, and thus constitute a data structure capable of providing powerful filtering capability for the efficient support of many different use cases.

Access units are the fundamental, finest grain data structure in terms of content protection and in terms of metadata association. In other words, each access unit can be protected individually and independently. Figure 1 shows how access units, blocks and genomic records relate to each other in the ISO/IEC 23092 series data structure.

Figure 1—Access units, blocks and genomic records

Figure 2—High-level data structure: datasets and dataset group

A dataset is a coded data structure containing headers and one or more access units. Typical datasets could, for example, contain the complete sequencing of an individual, or a portion of it. Other datasets could contain, for example, a reference genome or a subset of its chromosomes. Datasets are grouped in dataset groups, as shown in Figure 2.

A simplified diagram of the dataset decoding process is shown in Figure 3.

Figure 3—Decoding process

The International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC) draw attention to the fact that it is claimed that compliance with this document may involve the use of a patent.

ISO and IEC take no position concerning the evidence, validity and scope of this patent right.

The holder of this patent right has assured ISO and IEC that he/she is willing to negotiate licences under reasonable and non-discriminatory terms and conditions with applicants throughout the world. In this respect, the statement of the holder of this patent right is registered with ISO and IEC. Information may be obtained from the patent database available at www.iso.org/patents .

Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights other than those in the patent database. ISO and IEC shall not be held responsible for identifying any or all such patent rights.

ISO/IEC 23092-3:2022 情報技術 — ゲノム情報の表現 — Part 3: メタデータとアプリケーションプログラミングインターフェイス (API) | ページ 3

序章

Introduction

ISO PDF プレビュー