ISO/TR 21636-2:2023 言語コーディング — 言語多様性のためのフレームワーク — Part 2: フレームワークの説明

この規格プレビューページの目次

※一部、英文及び仏文を自動翻訳した日本語訳を使用しています。

序章

ますます多くのデジタル言語リソース (LR) が作成され (レトロデジタル化によっても)、アーカイブ、処理、分析されています。この文脈では、特定の言語使用イベントに存在する言語の多様性の詳細かつ正確な特徴付けが急速に重要になっています。ここでの言語使用には、書面、口頭、手話などのすべてのモダリティと、デジタル技術 (ソーシャルメディアや同様の形態のデジタルコミュニケーション) によってサポートされる言語使用の新しい形態が含まれます。しかし、これは言語が内部的に異なる方法の 1 つにすぎません。その他には、たとえば、よく知られている地域 (方言) および社会的バリエーションが含まれます。

以前は、LR を使用する主な目標は LR のアーカイブと保存でしたが、新しい目標が出現し、現在も出現しています。

機関と個人は、既存の LR に関する情報を調和のとれた形で広く利用できるようにするために、メタデータ (つまり、書誌記述データやその他の二次情報) を交換する必要があります。
研究者は、言語のバリエーションに関する研究を含む、さまざまな研究目的のために一次データ (つまり、LR 自体) を探しています。
研究者と開発者は、より高度な言語技術 (LT) の開発とテスト目的で LR を必要としています。LT は、特に音声認識と言語分析において、人間のコミュニケーションのより多くの次元に入り込んでいます。

上記の目標と目的を達成するために、このドキュメントで概説されていない他のものとともに、言語の多様性を識別するための標準化されたメタデータのセットは、二次情報の摩擦のない交換を保証するために重要です。よく整理されたメタデータは、相互運用性の程度 (LR の再利用可能性と再目的化可能性に等しい) と、時間の経過に伴うさまざまな状況または LR への LT の適用可能性を示すのにも役立ちます。これらのメタデータは、eBusiness, eHealth, eGovernment, eInclusion, eLearning, スマート環境、Ambient Assisted Livin, および LR に関する情報に依存する事実上すべての他のアプリケーションに適用できます。明確なメタデータアプローチは、言語リソースアーカイブの永続性のための前提条件でもあります (特に文化遺産や科学研究データの場合)

さまざまな個々の言語の識別は、既存の (生きている、絶滅した、歴史的な) 個々の言語と言語グループを識別する ISO 639 ¹の主題です。この文書と ISO 21636 シリーズは一般に、言語コードの枠組みを拡張して、さまざまな種類の言語の種類 (地理的、社会的、様式の種類など) を識別できるようにすることで、ISO 639 を前提とし、補足しています。言語の種類の識別は、LR を記述するための一般的なライブラリおよびアーカイブメタデータに含めることができます (これには、ISO 21636 シリーズの一部ではない、技術情報、録音の時間と場所、および同様の一般情報も含めることができます)

ISO 21636 シリーズの規定は以下をカバーしています。

言語内部の言語的変動を首尾一貫して扱うための一般的な概念フレームワーク。
言語の種類の識別と記述に関する一般規則。
ディメンションのセットと、それぞれのディメンションに割り当てることができる値の制限のないリストまたは閉じたリスト。
メタデータカテゴリのセットと、それぞれの可能な値の例。言語使用のイベントの記述の最も重要な側面と、言語のバリエーションに関連する結果の LR に従ってグループ化されています。

このドキュメントで取り上げるメタデータのカテゴリと値は、これらの包括的な原則に基づいて、言語の多様性の将来の非常に粒度の高いコーディングの候補となる可能性があります。したがって、このドキュメント (および一般的な ISO 21636 シリーズ) は、「ソフトウェアおよびコンテンツ開発原則 2010 に関する勧告」に準拠しており、メタデータに関する ISO/IEC 11179 シリーズの一般的なフレームワークに適合しています。

利害関係者には以下が含まれますが、これらに限定されません。

情報通信技術 (ICT) 産業 (LT を含む);
ライブラリ;
メディア産業 (エンターテイメントを含む);
インターネットコミュニティ;
言語の文書化と保存に携わる人々。
言語アーキビスト;
翻訳者および通訳;
研究者（言語学者、特に社会言語学者、民族学者、社会学者など）;
語学研修を提供する人や機関。
新しいユーザーコミュニティの出現。

これらの利害関係者は、特定の個々の言語だけでなく、特定の言語の種類にも言及する必要があると予想されます。たとえば、口頭での人間とコンピューターの対話、または特定の LR またはツールをニーズや特定の環境に合わせて調整する場合などです。ターゲットユーザーグループ。関連する個々の言語に内在する言語的変動の次元と、それぞれの関連する言語の種類を特定するために、最初のステップは必要な特異性を達成することです。このドキュメントで開発されたように、概念的に健全で統一されたリファレンスのフレームワークを採用することは、さまざまな個別のアドホックソリューションの急増よりも優れています。

Introduction

More and more digital language resources (LRs) are being created (also by retro-digitization), archived, processed and analysed. In this context, detailed and exact characterization of language varieties present in a given language use event is quickly gaining importance. Here, language use includes all modalities such as written, spoken or signed, and also new forms of language use supported by digital technology (in social media and similar forms of digital communication). But this is just one way in which languages vary internally. Others include, for instance, the well-known regional (dialectal) and social variation.

While in the past a primary goal of working with LRs was the archiving and preservation of LRs, new goals have emerged and are still emerging:

institutions and individuals need to exchange metadata (that is, bibliographic description data and other secondary information) for making the information on existing LRs widely available in a harmonized form;
researchers are looking for the primary data (that is, the LRs themselves) for many different research purposes, including research on linguistic variation;
researchers and developers need LRs for the development of more advanced language technologies (LTs) and for testing purposes, as LTs, in particular speech recognition and language analysis, are entering more and more dimensions of human communication.

In order to achieve the above-mentioned goals and purposes, along with others not outlined in this document, a standardized set of metadata for the identification of language varieties is important to guarantee frictionless exchange of secondary information. Well-organized metadata also help to indicate the degree of interoperability (equalling re-usability and re-purposability of LRs), and the applicability of LTs to different situations or LRs over time. These metadata are applicable in eBusiness, eHealth, eGovernment, eInclusion, eLearning, smart environments, ambient assisted living (AAL) and virtually all other applications which depend on information about LRs. A clear metadata approach is also a prerequisite for the durability of language resource archiving (in particular in the case of cultural heritage and scientific research data).

The identification of different individual languages is the subject of ISO 639 ¹ , which identifies existing (living, extinct and historical) individual languages, as well as language groups. This document, and the ISO 21636 series in general, presupposes and complements ISO 639 by extending the language code framework in order to allow for the identification of language varieties of different types (such as geographical, social and modal varieties, among others). The identification of language varieties can then be included in general, library and archival metadata for describing LRs (which can also include technical information, time and location of recording, and similar general information, which are not part of the ISO 21636 series).

The provisions of the ISO 21636 series cover:

a general conceptual framework to deal coherently with language-internal linguistic variation;
general rules for the identification and description of language varieties;
a set of dimensions and open-ended or closed lists of values that can be assigned to each respective dimension;
a set of metadata categories and examples for the respective possible values, grouped according to the most important aspects of the description of events of language use and resulting LRs, related to linguistic variation.

The metadata categories and values addressed in this document can be candidates for a future highly granular coding of language varieties based on these comprehensive principles. Thus, this document (and the ISO 21636 series in general) conforms to the “recommendations on software and content development principles 2010”, and fits within the general framework of the ISO/IEC 11179 series for metadata.

Stakeholders include, but are not limited to:

information and communication technologies (ICTs) industry (including LTs);
libraries;
the media industry (including entertainment);
internet communities;
people engaging in language documentation and preservation;
language archivists;
translators and interpreters;
researchers (linguists, in particular sociolinguists, ethnologists, sociologists, etc.);
people and institutions providing language training;
emerging new user communities.

It is anticipated that these stakeholders need to refer not only to a certain individual language, but also to a certain language variety, for instance for oral human-computer interaction, or for tailoring a certain LR or tool to the needs and specific environment of a target user group. In order to identify the dimension(s) of linguistic variation internal to individual languages involved, and the respective relevant language varieties, a first step is to achieve the needed specificity. Adapting a conceptually sound, uniform framework of reference as developed in this document is superior to the proliferation of different individual ad hoc solutions.

ISO/TR 21636-2:2023 言語コーディング — 言語多様性のためのフレームワーク — Part 2: フレームワークの説明 | ページ 3

序章

Introduction

ISO PDF プレビュー