Many methods of intelligent data analysis exists in the areas of
Knowledge Discovery in Databases (KDD), Data Mining, Machine Learning
and Statistics.
While most of these methods are quite general and applicable in
many and very different areas, there are a few specialities in the
area of fraud detection which can cause difficulties during the
naive application of these methods.
-
In all areas very large quantities of data are available with a
very small portion of fraud. In statistical terms:
the class distribution is highly skewed.
-
For each method of fraud an algorithm of detection has to be
found. Each algorithm has to be adapted with patterns and parameters
to recognice fraud cases efficiently.
-
The fraudsters however modify their methods and thus the detection
system has to be adapted to new fraud patterns continuously.
- A fast response time is necessary in order to minimize the damage. For
the detection of credit card fraud real time processing is necessary.
- There are two types of errors in binary classification: false alarms
(false positives) and undiscovered cases (false negatives). See the
following table.
|
Fraud |
No fraud |
| Alarm |
correct |
false positive |
| No alarm |
false negative |
correct |
-
Often the alarms need verification by human operators and are put in a
queue. So the costs of the two error classes are different. A false
positive needs operator time and a false negative causes further
losses. Cost sensitive methods that respect these
different costs are needed.
-
The continously changing and skewed distributions and the need for
cost sensitive methods complicate the evaluation of the performance of
fraud detection methods.
Even the evaluation of the performance of "standard" classification
methods can be difficult [Sal97],
Standard performance measures like error rate, accuracy and ROC
curves, however, are not suited for fraud detection
[CCLPS00,PFK98,
PF01].
A technique for fraud detection is the
ROC Convex Hull [PF01] method.
-
In the traditional data model of databases, i. e. persistent relations, the standard steps of
data processing are "First load the data, then index it, then run queries".
The steps of loading and indexing are often very time consuming and make real time processing
difficult or impossible. In some domains there is simply too much data for standard
relational databases.
So a new data model for data-intensive applications was developed:
continuous data streams.
This is still a research area, but some prototypical data stream management systems,
stream processing engines and a extensions of SQL to a Continuous Query Language (CQL)
were developed.