Abstract— Nowadays web mining has gained
more attention of users with its interfaces and large quantity of knowledge on
the market. This has earned users interest in searching plenty of useful data but
it is still restricted with the number of the resources extraction like unlabeled
photos. This paper gives the framework for automated face identification task
by taking the advantage of content-based image retrieval (CBIR) and search
based image retrieval (SBIR) techniques in mining the large amount of poorly
labeled  images on the internet. Since the
images are poorly labeled, it will be difficult to identify the similar images,
so to identify the poorly labeled similar images; we have proposed updated
unsupervised label refinement (ULR) approach. Search can be done based on the
name of image or the image itself, if the match is found in the unit, then the
similar images are displayed otherwise the output is null. Cluster analysis is
used to group the similar images. Also using the concept of association
analysis, count of the images are calculated based on the number of times the
images are searched.

 

Keywords-
Face
annotation, Web mining, Face detection, Indexing, Association analysis, Cluster
analysis.

__________________________________**********________________________________

I. INTRODUCTION

 

Data Mining has become more
important in society due to the large amount of data and changing such data
into the useful information and knowledge. Extracting or mining knowledge from
huge collection of data is called Data mining. The main goal of data mining is
to mine information from the set of data and translate it into an
understandable structure for future use. Data mining is one among the knowledge
discovery process.  Knowledge discovery has sequence of steps as:
Data cleaning, Data integration, Data selection, Data transformation, Data
Mining, Pattern evaluation, Knowledge presentation. It uses techniques that are
used to extract data patterns. Data mining system has engine which comprises of
set of functions for tasks such as characterization, association and
correlation analysis, classification, prediction.

Nowadays web mining has gained more
attention of users with its interfaces and large quantity of knowledge on the
market. Extracting patterns that are accessed by the users in distributed
information environment is called Web mining, Web search based on the single
keyword may outputs hundreds of web page links containing the keyword, but most
of the links will be weakly related to which the user want to search.

Extracting Frequent Patterns leads
to the discovery of interesting associations. Frequent patterns are the
patterns which occur frequently. Market basket analysis is an example of
frequent itemset. Association analysis is the method which is used to find
interesting relationship hidden in large amount of data. Association analysis
are used to cover relationship among related data in the database, relational
database or other information repository. Association rules are used to find
the relationships between the objects which are frequently used together.
Applications of association rules are basket data analysis, classification,
cross-marketing, clustering, catalog design, and loss-leader analysis etc. In
this paper we are using item sets as images where related images are displayed
based on content based (image itself ) and image name based.

For example, if the
customer buys rice then he may also buy dhal. If the customer buys mobile then
he may also buy memory card. There are two measures that association rules
uses, support and .confidence. It identifies the relationships generated by
analyzing data for frequently used patterns. Association rules are usually
needed to satisfy a user-specified minimum support and a user –specified minimum
confidence at the same time.

         

 

Nowadays
with the use of various digital cameras and the rapid growth of social media
for internet-based photo sharing, recent years have witnessed an explosion of
the number of digital photos captured and stored by users. Major issue that has
to be taken care is the recognition of images that is to identify or verify the
images using the database where the images are stored. Image recognition is an
important part of the capability of human perception system. The initial work
on image recognition can be traced back at least to the 1950s in psychology and
to the 1960s in the engineering literature. Some of the earliest studies
include work on facial expression of emotions by Darwin.

 

Later many concepts
were used in the recognition of images such as identification number, race,
age, gender, facial expression, or speech may be used in narrowing the search
(enhancing recognition). The solution to the problem involves segmentation of
faces (face detection) from cluttered scenes, feature extraction from the face
regions, recognition, verification and also indexing may be applied on images.
In identification problems, the input to the system may be given as image or
the name of the image, and the system outputs the similar images from a
database of known individuals or else outputs null, whereas in verification
problems, the system needs to confirm or reject the identity of the input
image.

 

 In most cases photos shared by users on the
web are facial images. Some facial images are label with names, some may be
weakly labeled and some are not labeled properly. This motivated to an important
technique that is to find facial images automatically. This can be useful to
many applications on web and online photo-sharing sites can automatically
labels user uploaded photos to provide online photo search A method is
presented for giving label to facial image by mining the web, where a huge
number of weakly labeled images are available freely in internet. This aims to
the automated face annotation(identification) task by taking the advantage of
content-based image retrieval (CBIR) and search based image retrieval (SBIR)
techniques in mining the large amount of poorly labeled  images on the internet. This framework is
model-free and data-driven. The main motives of these schemes are to assign
correct name labels to a given image query. For given a novel facial image for
annotation, first we have to retrieve a short list of top n most same facial
image pixels from a poorly labeled facial image database, and then annotate the
facial image by the names(labels) associated with the top n facial images of
same pixel value(binary value).

 

One
challenge faced by CBIR and SBIR techniques is how to effectively identify and
to short list similar facial images and their weak labels for the face name
annotation task. To solve this, we use a novel updated unsupervised label
refinement (ULR) scheme by considering machine learning techniques. We also
propose Cluster based approximation algorithm (CBA) and Association rule based
approximation (ABA) algorithm to improve the efficiency. We can also provide
facility to search similar images by giving input in the form of image.

 

II.
LITERATURE SURVEY

 

Dayong Wang, Steven
C.H. Hoi, Ying He, and Jianke Zhu has proposed –  Mining Weakly Labeled Web Facial Images for
Search-Based Face Annotation gives a framework of search-based face annotation
(SBFA) by mining weakly labeled facial images that are freely available on the
World Wide Web (WWW). This mainly exploits the list of most similar facial
images and their labels that are noisy that uses unsupervised label refinement
(ULR) approach for refining the labels of web facial images using machine
learning techniques.

 

Zhong Wu, Qifa Key, Jian Suny, Heung-Yeung Shumy
has proposed – Scalable Face Image Retrieval with Identity-Based
Quantization and Multi-Reference Re-ranking, which aims to build a scalable face image retrieval system and develops a new
scalable face representation using both local and global features. In the
indexing stage, exploits special properties of faces to design new
component-based local features,which are subsequently quantized into visual
words using a novel identity-based quantization scheme.

 

Preeti Chouhan, Mukesh Tiwari has proposed – Feature Extraction Techniques for
Image Retrieval Using Data Mining and Image Processing Techniques provides with
a basic informatory review on the applied fields of data mining which is varied
into manufacturing, telecommunication, education, fraud detecting and marketing
sector. Includes some of the methods like clustering, correlation, association
and neural network and also provides concepts on Image mining. Image mining
deals with  association of image data and
extraction of hidden data.

C.Ganesh, B.Sathiyabhama, T.Geetha has proposed- Fast Frequent Pattern Mining Using Vertical Data Format for
Knowledge Discovery, provides Apriori based techniques,
Frequent Pattern growth (FP-growth) and Equivalence CLASS Transformation
(ECLAT) are the widely used approaches used in extracting frequent patterns.
Also quantitative investigation of changing the format stream is done for
better result in less computational time.

 

II.
METHODOLOGY

 

A.
Existing System

In
the Existing system, object recognition techniques is used to train
classification models from human-tagged training images or attempt to show the
correlation between annotated keywords and images. Given limited training data,
semi-supervised learning methods have been used for image identification in classical
classification models.

Limitations:

1.
Similar clear Images were not displayed using local binary system.

2. Poorly
appeared images or poorly labeled images are difficult to identify.

3. Always
produces approximate results.

4. There
was no ranking (count) scheme.

 

B. Proposed
System

This
paper mainly gives a framework for search-based and content based image
retrieval techniques by mining weakly named images that are available . Since
the images are poorly labeled, it will be difficult to identify the similar
images. So to identify the poorly labeled similar images, we have proposed
updated unsupervised label refinement (ULR) approach . To perform search on
images we are using ULR algorithm having the binary format of the images. Search
can be done based on the name of image or the image itself, if the match is
found in the unit then the similar images are displayed otherwise the output is
null. Grouping of images are done using cluster approximation. Also count of
the images are monitored based on user clicks for the respective images which
is searched.

Advantages:

1.
Similar Clear Images were retrieved based on image itself or the name of the
image.

2. Easy
to retrieve the images since the names are given to the images.

3.
Produces accurate results.

4. There
is ranking (count) scheme based on the number of times the user searched for
particular image

We have 4
important modules in this process:

1.      Labeling
Images: Images
are uploaded by giving label(name) to the images.

2.      Content-Based Image Retrieval: In this module, input is given as
image and outputs group of images that are similar to the input image else
outputs null. Query by image content (QBIC) method is used.

3.      Search based Image Retrieval: In
this module, input is given as name and outputs group of images that are similar
to the input name else outputs null. Query by image name(QBNC) method is used.

4.      Ranking
Scheme: Count of
the respective images that are searched are recorded.

 

C.
Architecture

                                                                    
Figure 1

Figure 1 illustrates the system flow of
the proposed framework of search-based face annotation, which consists of the
following steps:

1.Collection
of images, Labeling and Storing

2.
Detection and Feature Extraction based on the input

3.
Performing Indexing and Collect the labeled data using the URL technique

4. Face
annotation where similar Images are retrieved using Cluster analysis

5. Face
annotation by ranking scheme using association analysis

The first 3 steps are
usually conducted before the test phase of a Image identification task, while
the last two steps are conducted during the test phase of a Image
identification task, which usually should be done very efficiently. We briefly
describe each step below.

The first step is the data collection of
facial images as shown in Figure 1, in which we collect the images by Google
search engine. Given the nature of web images, the images may be noisy, which
do not always correspond to the correct name and such images are weakly or
poorly labeled facial images. The second step is to  detect and extract the feature of images , we
use the unsupervised face alignment technique proposed in 4. For facial
feature representation, we extract the GIST texture features 5 to represent
the extracted faces. The third step is indexing the extracted features of the images
by applying some efficient high-dimensional indexing technique. So for this, we
use the locality sensitive hashing (LSH) 6 and  

unsupervised learning scheme is used to
enhance the label quality of the weakly labeled facial images which is
important in the search process.

The first 3 steps are the phases
involved in updated ULR algorithm.

The fourth step is
grouping of related images (K similar images) using cluster approximation
algorithm. Last step is Face
annotation by counting the user clicks based on user search of particular image
and this is done using association analysis.

 

III.
ALGORITHMS                                                                                                            

A.     Updated ULR algorithm

Input: Image

Output: Similar Images/ NULL

Begin

Collection of images, Labeling and Storing

Detection and Feature Extraction based on the input

Performing Indexing and Collect the labeled data
using the URL technique

            End

           

B.    
Cluster
Based Approximation Algorithm:

The
number of variables in the extracted image feature are a * b. Where a= number
of facial images in the retrieval database. b= number of distinct names. In
this paper strategy could be applied in two different phases:

Image
retrieval based on  

1. One is
on “image itself,” which can be used to separate all the ‘a’ facial images into
similar group

2. Second
one is on “image name,” which can be used to separate the ‘b’ names into a
group.

Then
based on the input which is given the similar images of respective cluster are
displayed.

 

In
this paper k-NN clustering technique is used for clustering the images.

The k-Nearest
Neighbors algorithm (k-NN) is used for classification and
regression. In K-NN, the input consists of the k closest sample data. In k-NN
classification, the output is a class member. An object is classified by a
popular counting of its neighbor point . If k = 1, then the object is assigned
to the class of that one nearest neighbor. The property value is the object in
k-NN regression. This value is the average of the value of its k nearest
neighbors.

 

C.    
Association
Based Approximation Algorithm:

            Here, based on the image retrieval
the count is monitored for every user clicks on the images and that count will
be reflected in the clusters for the further use.

 

IV. CONCLUSION

In
this paper a search-based image retrieval and content based image retrieval
techniques are used to mine the huge amount of poorly labeled images that are
freely available on the web. It uses a updated ULR algorithm to identify the
images, Cluster approximation method to group the similar images for
scalability and Association analysis method used to monitor the number of times
the particular image is been searched. All these methods improve the
performances and also scalability without degrading the system performance. Future
enhancement can be done on retrieval of images based on time.