DATA MINING –CONCEPT AND TECHNIQUES
INTRODUCTION
Data mining is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cut costs, or both. It allows users to analyze data from many different dimensions or angles, categorize it and summarize the relationships identified. It is the process of finding correlations or patterns among dozens of fields in large relational databases.
DATA, INFORMATION AND KNOWLEDGE
Data: Any facts, numbers, or text that can be processed by a computer is Data. Large amounts of data are being accumulated by the organizations in different formats and different databases. There are three types of data.
Operational or transactional data: This includes sales, cost, inventory, payroll, and accounting.
Non-operational data: Data from industry sales, forecast data, and macro economic data are considered non-operational.
Meta data: Data about the data itself. This includes logical database design or data dictionary definitions.
Information: Patterns, associations, or relationships among all the above type of data provide information of all the above types of data.
Knowledge: Information can be converted into knowledge about historical patterns and future trends. For example, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.
FOUNDATIONS OF DATA MINING
Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature:
⃰ Massive data collection
⃰ Powerful multiprocessor computers
⃰ Data mining algorithms
ROLE OF DATA MINING
Companies with a strong consumer focus - retail, financial, communication and marketing organizations, primarily use data mining today. It enables these companies to analyze the relationships among "internal" factors such as price, product positioning, staff skills ,etc., and "external" factors such as economic indicators, technology, competition, and customer demographics in order to determine the impact on sales, customer satisfaction, and corporate profits.
Data mining helps the retailer to use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, the retailer can develop products and promotions to appeal to specific customer segments. Data mining software analyzes relationships and patterns in stored transaction data based on the user queries. Software available for such analysis is: statistical, machine learning and neural networks. Four types of relationships are sought for this purpose. They are:
Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information can be used to increase traffic by having daily specials.
Clusters: Grouping of data items according to logical relationships or consumer preferences.
Associations: Data mining can be used to identify associations. Example-beer and diaper.
Sequential patterns: Data mining is used to determine behaviour patterns and trends. For example, an outdoor equipment retailer can predict the likelihood of a backpack being purchased based on consumers’ purchase of sleeping bags and hiking shoes.
MAJOR ELEMENTS OF DATA MINING
· Extract, transform, and load transaction data
· Store and manage the data
· Provide data access
· Analyze the data by software.
· Present the data.
DATA MINING TECHNIQUES
The most commonly used techniques in data mining are:
Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.
Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) .
Genetic algorithms: Optimization techniques that use process such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.
Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset
Rule induction: The extraction of useful if-then rules from data based on statistical significance.
INFRASTRUCTURE REQUIRED
Today, data mining applications are available on all size systems for mainframe, client/server and PC platforms. System prices range from several thousand dollars for the smallest applications up to $1 million a terabyte for the largest. Enterprise-wide applications generally range in size from 10 gigabytes to over 11 terabytes. The factors that affect the system are:
Size of the database: the more data being processed and maintained, the more powerful the system required.
Query complexity: the more complex the queries and the greater the number of queries being processed, the more powerful the system required.
CONCLUSION
Most companies already collect and refine massive quantities of data. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line.