ISO/IEC 30107-3:2017 情報技術—生体認証によるプレゼンテーション攻撃の検出—パート3：テストとレポート

この規格プレビューページの目次

※一部、英文及び仏文を自動翻訳した日本語訳を使用しています。

序章

システムポリシーに干渉することを意図した方法で、人工物または人間の特性をバイオメトリックキャプチャサブシステムに提示することは、提示攻撃と呼ばれます。 ISO/IEC 30107 (すべての部分) は、プレゼンテーション攻撃の自動検出のための技術を扱っています。これらの手法は、プレゼンテーション攻撃検出 (PAD) メカニズムと呼ばれます。

バイオメトリック認識の場合と同様に、PAD メカニズムは偽陽性および偽陰性のエラーの影響を受けます。偽陽性のエラーは、善意のプレゼンテーションを攻撃のプレゼンテーションとして誤って分類し、正当なユーザーにフラグを立てたり不便を与えたりする可能性があります。偽陰性のエラーは、プレゼンテーション攻撃 (攻撃プレゼンテーションとも呼ばれます) を善意のプレゼンテーションとして誤って分類し、セキュリティ侵害につながる可能性があります。

したがって、PAD の特定の実装を使用する決定は、アプリケーションの要件と、セキュリティ、証拠の強度、および効率に関するトレードオフの考慮に依存します。

このドキュメントの目的は次のとおりです。

バイオメトリックプレゼンテーション攻撃の検出テストとレポートに関連する用語を定義する
メトリクスを含む、バイオメトリックプレゼンテーション攻撃検出のパフォーマンス評価の原則と方法を指定します。

このドキュメントは、PAD メカニズムの評価を実施しようとしているベンダーまたはテストラボを対象としています。

統計分析のための生体認証性能試験の用語、実践、および方法論は、ISO および Common Criteria によって標準化されています。 FAR, FRR, FTE などのメトリックは、生体認証システムのパフォーマンスを特徴付けるために広く使用されています。生体認証性能試験の用語、実践、および統計分析の方法論は、生体認証性能試験の概念と PAD メカニズム試験の概念との間には重大で根本的な違いがあるため、PAD メカニズムの評価には部分的にしか適用できません。これらの違いは、次のように分類できます。

a) 統計的有意性

バイオメトリクスパフォーマンステストでは、ターゲットユーザーグループを代表する統計的に有意な数の被験者を利用します。被験者を追加したり、まったく異なるグループを使用したりしても、誤り率が大きく変わることはないと予想されます。一般に、より多くの測定を行うと、エラー率の精度が向上します。

PAD テストでは、多くの生体認証モダリティが、多数または不確定な数の PAI (潜在的なプレゼンテーション攻撃手段) 種によって攻撃される可能性があります。このような場合、考えられるすべてのプレゼンテーション攻撃手段の包括的なモデルを持つことは非常に困難であるか、不可能ですらあります。したがって、評価のための PAI 種の代表的なセットを見つけることは不可能である可能性があります。したがって、プレゼンテーション攻撃手段の 1 つのセットの測定されたエラー率は、別のセットに適用できると仮定することはできません。

PAI 種は、試験における系統的変動の原因となります。 PAI が異なれば、エラー率も大幅に異なる場合があります。さらに、特定の PAI 種内では、PAI シリーズのインスタンス間でランダムな変動があります。統計的に有意なテストに必要なプレゼンテーションの数は、関心のある PAI 種の数に比例します。各 PAI 種内で、PAD エラー率の推定に関連する不確実性は、テストされたアーティファクトの数と個人の数によって異なります。

例 1

指紋バイオメトリクスでは、多くの有力なアーティファクト材料が知られていますが、指紋の特徴をバイオメトリックセンサーに提示できる任意の材料または材料混合物が候補となります。年齢、厚さ、水分、温度、混合率、製造方法などのアーティファクトプロパティが PAD メカニズムの出力に大きな影響を与える可能性があるため、現在の材料を使用して何万もの PAI 種を定義することは簡単です。適切な統計分析を行うには、数十万回のプレゼンテーションが必要になります。それでも、結果として得られるエラー率を次の一連の新しい資料に転送することはできません。

b) システム間のテスト結果の比較可能性

生体認証パフォーマンステストでは、生体認証サンプルの同じコーパスに基づくアプリケーション固有のエラー率を使用して、異なる生体認証システムまたは異なる構成を比較できます。「より良い」と「より悪い」の意味は一般的に理解されています。

対照的に、エラー率を使用して PAD メカニズムのベンチマークを行う場合、「より良い」などの用語は、意図するアプリケーションに大きく依存する可能性があります。

例 2

10 個の PAI 種 (100 回提示) を使用した特定のテストシナリオでは、システム₁は攻撃提示の 90% を検出し、システム₂は 85% を検出します。システム₁は、9 つの PAI 種のすべてのプレゼンテーションを検出しますが、10 番目の PAI 種のすべてのプレゼンテーションを検出できません。システム₂は、すべての PAI 種の 85% を検出します。どちらが良いですか？セキュリティ分析では、システム₁はシステム₂よりも劣っています。なぜなら、10 番目の PAI 種を明らかにすると、攻撃者は、この方法を使用してキャプチャデバイスを常に打ち負かすことができるようになるからです。ただし、攻撃者が 10 番目の PAI 種を使用するのを防ぐことができれば、システム₁はシステム₂よりも優れています。これは、個々の率が、すべての PAI 種でシステム₂を克服できることを示しているためです。

c) 協力

多くの生体認証性能テストは、被験者が協力するアクセス制御などのアプリケーションに対応しています。誤った操作によるエラーは、意図ではなく、知識、経験、または指導の欠如の問題です。グループ内の重大な非協力的な行動は、基礎となる「バイオメトリックモデル」の一部ではなく、決定されたエラー率をバイオメトリックパフォーマンステストにはほとんど役に立たなくします。

PAD テストには、行動が非協力的な被験者が含まれます。攻撃者は、生体認証システムの弱点を見つけて悪用し、意図した操作を回避または操作しようとします。テスト担当者の経験と知識に基づくプレゼンテーション攻撃の種類によって、攻撃の成功率が劇的に変化する可能性があります。したがって、協調行動を代表する方法でエラー率を測定するテスト手順を定義することは困難な場合があります。

d) 自動テスト

生体認証性能テストでは、多くの場合、同等の品質のデバイスまたはセンサーのデータベースを使用して比較アルゴリズムをテストできます。パフォーマンスは、ISO/IEC 19795-1 で指定されているように、以前に収集されたサンプルのコーパスを使用して技術評価で測定できます。

PAD テストでは、バイオメトリックセンサーからのデータ (デジタル化された指紋画像など) では、評価を行うには不十分な場合があります。 PAD メカニズムを備えたバイオメトリックシステムには、多くの場合、バイオメトリック特性の特定のプロパティを検出するための追加のセンサーが含まれています。したがって、特定のバイオメトリックシステムまたは構成用に以前に収集されたデータベースは、別のバイオメトリックシステムまたは構成には適していない場合があります。ハードウェアまたはソフトウェアのわずかな変更でも、以前の測定が役に立たなくなる可能性があります。多変量同期 PAD 信号を保存し、自動テストで再生することは、一般的に非現実的です。したがって、多くの場合、自動テストは PAD メカニズムのテストと評価のオプションではありません。

e) 品質と性能

バイオメトリクスパフォーマンステストでは、パフォーマンスは通常、バイオメトリクスデータの品質に直接関係しています。一般に、低品質のサンプルではエラー率が高くなりますが、高品質のサンプルのみを使用したテストではエラー率が低くなります。そのため、パフォーマンスを改善するために品質指標がよく使用されます (アプリケーションによって異なります)

PAD テストでは、生体認証の品質が低いとアーティファクトが失敗する可能性がありますが、一般に、アーティファクトから特定の品質レベルを想定する理由はありません。アーティファクトからのサンプルは、人間のバイオメトリック特性からのサンプルよりも優れた品質を示すことができます。攻撃者のスキルのモデルがない場合、(少なくともセキュリティ評価では) 攻撃者が常に可能な限り最高の品質を使用するという「最悪の場合」のシナリオを想定することは有効であるように思われます。そのようにして、必要なテストの数を同時に減らしながら、少なくとも特定のテストセットの保証された最小検出率を決定することができます。次に、Common Criteria 評価での慣行と同様に、セキュリティレベルを評価するために、成功したアーティファクトの攻撃の可能性 (必要な品質のための努力と専門知識) を評価することが重要です。

a) から e) までの違いに基づいて、PAD メカニズムに関連するエラー率とメトリックに関する次の一般的なコメントを導き出すことができます。

評価では、PAI 種が個別に分析/評価されます。
PAI 種の 0% 以外の攻撃プレゼンテーション分類エラー率は、PAI が成功できることを証明するだけです。別のテスターは、より高いまたはより低い攻撃プレゼンテーション分類エラー率を達成する可能性があります。さらに、関連する資料とプレゼンテーションパラメータを特定するためのトレーニングにより、この PAI 種の攻撃プレゼンテーション分類エラー率が増加する可能性があります。テスターの経験と知識、および必要なリソースの可用性は、PAD テストの重要な要素であり、比較やパフォーマンス分析を行う際に考慮されます。
PAD メカニズムのエラー率は、特定の PAD メカニズムの特定のコンテキスト、PAI 種のセット、アプリケーション、テストアプローチ、およびテスターによって決まります。 PAD メカニズムのエラー率は、同様のテスト間で必ずしも比較できるわけではなく、PAD メカニズムのエラー率は、異なるテスト機関で必ずしも再現できるとは限りません。

Introduction

The presentation of an artefact or of human characteristics to a biometric capture subsystem in a fashion intended to interfere with system policy is referred to as a presentation attack. ISO/IEC 30107 (all parts) addresses techniques for the automated detection of presentation attacks. These techniques are called presentation attack detection (PAD) mechanisms.

As is the case for biometric recognition, PAD mechanisms are subject to false positive and false negative errors. False positive errors wrongly categorize bona fide presentations as attack presentations, potentially flagging or inconveniencing legitimate users. False negative errors wrongly categorize presentation attacks (also known as attack presentations) as bona fide presentations, potentially resulting in a security breach.

Therefore, the decision to use a specific implementation of PAD will depend upon the requirements of the application and consideration of the trade-offs with respect to security, evidence strength, and efficiency.

The purpose of this document is as follows:

to define terms related to biometric presentation attack detection testing and reporting, and
to specify principles and methods of performance assessment of biometric presentation attack detection, including metrics.

This document is directed at vendors or test labs seeking to conduct evaluations of PAD mechanisms.

Biometric performance testing terminology, practices, and methodologies for statistical analysis have been standardized through ISO and Common Criteria. Metrics such as FAR, FRR, and FTE are widely used to characterize biometric system performance. Biometric performance testing terminology, practices, and methodologies for statistical analysis are only partially applicable to the evaluation of PAD mechanisms due to significant, fundamental differences between biometric performance testing concepts and PAD mechanism testing concepts. These differences can be categorized as follows:

a) Statistical significance

Biometric performance testing utilizes a statistically significant number of test subjects representative of the targeted user group. Error rates are not expected to vary significantly when adding more test subjects or using a completely different group. Generally, taking more measurements increases the accuracy of the error rates.

In PAD testing, many biometric modalities can be attacked by a large or indeterminate number of potential presentation attack instrument (PAI) species. In these cases, it is very difficult or even impossible to have a comprehensive model of all possible presentation attack instruments. Hence, it could be impossible to find a representative set of PAI species for the evaluation. Therefore, measured error rates of one set of presentation attack instruments cannot be assumed to be applicable to a different set.

PAI species present a source of systematic variation in a test. Different PAI may have significantly different error rates. Additionally, within any given PAI species, there will be random variation across instances of the PAI series. The number of presentations required for a statistically significant test will scale linearly with the number of PAI species of interest. Within each PAI species, the uncertainty associated with a PAD error rate estimate will depend on the number of artefacts tested and the number of individuals.

EXAMPLE 1

In fingerprint biometrics, many potent artefact materials are known, but any material or material mixture that can present fingerprint features to a biometric sensor is a possible candidate. Since artefact properties such as age, thickness, moisture, temperature, mixture rates, and manufacturing practices can have a significant influence on the output of the PAD mechanism, it is easy to define tens of thousands of PAI species using current materials. Hundreds of thousands of presentations would be needed for a proper statistical analysis – even then, resulting error rates could not be transferred to the next set of new materials.

b) Comparability of test results across systems

In biometric performance testing, application-specific error rates based on the same corpus of biometric samples can be used to compare different biometric systems or different configurations. The meaning of “better” and “worse” is generally understood.

By contrast, when using error rates to benchmark PAD mechanisms, terms such as “better” can be highly dependent on the intended application.

EXAMPLE 2

In a given testing scenario with 10 PAI species (presented 100 times), System₁ detects 90 % of attack presentations and System₂ detects 85 %. System₁ detects all presentations for 9 PAI species but fails to detect all presentations with the 10th PAI species. System₂ detects 85 % of all PAI species. Which is better? In a security analysis, System₁ would be worse than System₂, because revealing the 10th PAI species would orient an attacker such that he could use this method to defeat the capture device all the time. However, if attackers could be prevented from using the 10th PAI species, System₁ would be better than System₂, because individual rates indicate that it is possible to overcome System₂ with all PAI species.

c) Cooperation

Many biometric performance tests address applications such as access control in which subjects are cooperative. Errors due to incorrect operation are an issue of a lack of knowledge, experience or guidance rather than intent. Significant uncooperative behaviour in a group is not part of the underlying “biometric model” and would render the determined error rates almost useless for biometric performance testing.

PAD tests include subjects whose behaviour is not cooperative. Attackers will try to find and exploit any weakness of the biometric system, circumventing or manipulating its intended operation. Presentation attack types, based on the experience and knowledge of the tester, can change the success rates for an attack dramatically. Hence, it can be difficult to define testing procedures that measure error rates in a fashion representative of cooperative behaviour.

d) Automated testing

In biometric performance testing, it is often possible to test comparison algorithms using databases from devices or sensors of similar quality. Performance can be measured in a technology evaluation using previously collected corpuses of samples as specified in ISO/IEC 19795-1.

In PAD testing, data from the biometric sensor (e.g. digitized fingerprint images) may be insufficient to conduct evaluations. Biometric systems with PAD mechanisms often contain additional sensors to detect specific properties of a biometric characteristic. Hence, a database previously collected for a specific biometric system or configuration may not be suitable for another biometric system or configuration. Even slight changes in the hardware or software could make earlier measurements useless. It is generally impractical to store multivariate synchronized PAD signals and replay them in automated testing. Therefore, automated testing is often not an option for testing and evaluating PAD mechanisms.

e) Quality and performance

In biometric performance testing, performance is usually linked directly to biometric data quality. Low-quality samples generally result in higher error rates while a test with only high-quality samples will generally result in lower error rates. Hence, quality metrics are often used to improve performance (dependent on the application).

In PAD testing, even though low biometric quality can cause an artefact to be unsuccessful, there is no reason to assume a certain quality level from artefacts in general. Samples from artefacts can exhibit better quality than samples from human biometric characteristics. Absent a model of attacker skill, it seems valid (at least in a security evaluation) to assume a “worst case” scenario where the attacker always uses the best possible quality. That way, one can at least determine a guaranteed minimal detection rate for the specific test set while reducing the number of necessary tests at the same time. It is then a matter of rating the attack potential of successful artefacts (effort and expertise for the needed quality) in order to assess the security level, as is the practice in Common Criteria evaluations.

Based on the differences a) through e), the following general comments regarding error rates and metrics related to PAD mechanisms can be derived:

In an evaluation, PAI species are analysed/rated separately.
Attack presentation classification error rates other than 0 % for a PAI species only prove that the PAI can be successful. A different tester might achieve a higher or lower attack presentation classification error rate. Further, training to identify the relevant material and presentation parameters could increase the attack presentation classification error rate for this PAI species. The experience and knowledge of the tester, as well as the availability of the necessary resources, are significant factors in PAD testing and are taken into account when conducting comparisons or performance analysis.
Error rates for PAD mechanisms are determined by the specific context of the given PAD mechanism, the set of PAI species, the application, the test approach, and the tester. Error rates for PAD mechanisms are not necessarily comparable across similar tests, and error rates for PAD mechanisms are not necessarily reproducible by different test laboratories.

ISO/IEC 30107-3:2017 情報技術—生体認証によるプレゼンテーション攻撃の検出—パート3：テストとレポート | ページ 3

序章

Introduction

ISO PDF プレビュー