Data Mining and Discovering Knowledge in your Data
The continuing rapid growth of on-line data and the
widespread use of databases necessitate the development of techniques
for extracting useful knowledge and for facilitating database access.
The challenge of extracting knowledge from data is of common interest
to several fields, including statistics, databases, pattern recognition,
machine learning, data visualization, optimization, and high-performance
computing.
Data mining is the process of extracting knowledge
from data. The combination of fast computers, cheap storage,
and better communication makes it easier by the day to tease useful information
out of everything from supermarket buying patterns to credit histories.
For clever marketeers, that knowledge can be worth as much as the stuff
real miners dig from the ground.
Data Mining as an analytic process designed to
explore large amounts of (typically business or market related) data and
search for consistent patterns and/or systematic relationships between
variables, and then to validate the findings by applying the
detected patterns to new subsets of data. The process thus consists of
three basic stages: exploration, model building or pattern definition,
and validation/verification.
What distinguishes data mining from conventional statistical
data analysis is that data mining is usually done for the purpose of "secondary
analysis" aimed at finding unsuspected relationships unrelated to
the purposes for which the data were originally collected.
Data warehousing as a process of organizing the storage
of large, multivariate data sets in a way that facilitates the retrieval
of information for analytic purposes.
Data mining is now a rather vague term, but the element
that is common to most definitions is "predictive modeling with large
data sets as used by big companies". Therefore, data mining is the
extraction of hidden predictive information from large databases. It is
a powerful new technology with great potential, for example,to help marketing
managers "preemptively define the information market of tomorrow."
Data mining tools predict future trends and behaviors, allowing businesses
to make proactive, knowledge-driven decisions. The automated, prospective
analyses offered by data mining move beyond the analyses of past events
provided by retrospective tools. Data mining answers business questions
that traditionally were too time-consuming to resolve. Data mining tools
scour databases for hidden patterns, finding predictive information that
experts may miss because it lies outside their expectations.
Data mining techniques can be implemented rapidly on
existing software and hardware platforms across the large companies to
enhance the value of existing resources, and can be integrated with new
products and systems as they are brought on-line. When implemented on
high performance client-server or parallel processing computers, data
mining tools can analyze massive databases while a customer or analyst
takes a coffee break, then deliver answers to questions such as, "Which
clients are most likely to respond to my next promotional mailing, and
why?"
Knowledge discovery in databases aims at tearing down
the last barrier in enterprises' information flow, the data analysis step.
It is a label for an activity performed in a wide variety of application
domains within the science and business communities, as well as for pleasure.
The activity uses a large and heterogeneous data-set as a basis for synthesizing
new and relevant knowledge. The knowledge is new because hidden relationships
within the data are explicated, and/or data is combined with prior knowledge
to elucidate a given problem. The term relevant is used to emphasize that
knowledge discovery is a goal-driven process in which knowledge is constructed
to facilitate the solution to a problem.
Knowledge discovery maybe viewed as a process containing
many tasks. Some of these tasks are well understood, while others depend
on human judgment in an implicit matter. Further, the process is characterized
by heavy iterations between the tasks. This is very similar to many creative
engineering process, e.g., the development of dynamic models. In this
reference mechanistic, or first principles based, models are emphasized,
and the tasks involved in model development are defined by:
1. Initial data collection and problem formulation. The
initial data are collected, and some more or less precise formulation
of the modeling problem is developed.
2. Tools selection. The software tools to support modeling
and allow simulation are selected.
3. Conceptual modeling. The system to be modeled, e.g.,
a chemical reactor, a power generator, or a marine vessel, is abstracted
at first. The essential compartments and the dominant phenomena occurring
are identified and documented for later reuse.
4. Model representation. A representation of the system
model is generated. Often, equations are used; however, a graphical block
diagram (or any other formalism) may alternatively be used, depending
on the modeling tools selected above.
5. Implementation. The model representation is implemented
using the means provided by the modeling system of the software employed.
These may range from general programming languages to equation-based modeling
languages or graphical block-oriented interfaces.
6. Verification. The model implementation is verified
to really capture the intent of the modeler. No simulations for the actual
problem to be solved are carried out for this purpose.
7. Initialization. Reasonable initial values are provided
or computed, the numerical solution process is debugged.
8. Validation. The results of the simulation are validated
against some reference, ideally against experimental data.
9. Documentation. The modeling process, the model, and
the simulation results during validation and application of the model
are documented.
10. Model application. The model is used in some model-based
process engineering problem solving task.
For other model types, like neural network models where
data-driven knowledge is utilized, the modeling process will be somewhat
different. Some of the tasks, like the conceptual modeling phase, will
vanish.
Typical application areas for dynamic models are control,
prediction, planning, and fault detection and diagnosis. A major deficiency
of today's methods is the lack of ability to utilize a wide variety of
knowledge. As an example, a black-box model structure has very limited
abilities to utilize first principles knowledge on a problem. this has
provided a basis for developing different hybrid schemes. Two hybrid schemes
will highlight the discussion. First, it will be shown how a mechanistic
model can be combined with a black-box model to represent a pH neutralization
system efficiently. Second, the combination of continuous and discrete
control inputs is considered, utilizing a two-tank example as case. Different
approaches to handle this heterogeneous case are considered.
The hybrid approach may be viewed as a means to integrate
different types of knowledge, i.e., being able to utilize a heterogeneous
knowledge base to derive a model. Standard practice today is that methods
and software can treat large homogeneous data-sets. A typical example
of a homogeneous data-set is time-series data from some system, e.g.,
temperature, pressure, and compositions measurements over some time frame
provided by the instrumentation and control system of a chemical reactor.
If textual information of a qualitative nature is provided by plant personnel,
the data becomes heterogeneous.
Back to Statistical
Forecasting Home Page |