PUKYONG

Design and Implementation of an Efficient Web Crawling

Metadata Downloads
Abstract
The key part for the attainment of the World Wide Web (WWW) is its large size and the lack of centralized control over its contents. Both manifestations are also the most important source of problems for locating information. The web is a context in which traditional Information Retrieval (IR) methods are challenged, and given the volume of the web and its speed of change, the coverage of modern search engines is comparatively petty. Moreover, the distribution of quality is very skewed, and interesting pages are scarce in comparison with the rest of the content.
The web crawler is a program, which is commonly used by search engines to find the new brainchild on the internet. The use of crawlers has made the web easier for users. In this thesis, we have used unstructured data by structuralization to collect data from the web pages. Our system is able to choose the word near our keyword in more than one document using unstructured way. Neighbor data were collected on the keyword through word2vec. In this research, we have introduced (Word2vec-Tiling) a new technique to collect data from the web pages. In our research, we used word2vec in the tiling method. Word2vec is considered as one of the easiest and most popular algorithms for identifying the relationship weight among the documents. Thus, the user will get accurate information related to his keyword. In this thesis, specific keywords were used by Word2vec-Tiling to identify naturally fused word frequencies, semantic relationships, and directional text-ranks. Finally, our system proposed a competent web search crawling algorithm that is derived from word2vec and Reinforcement Learning (RL) and Term Frequency and Inverse Document Frequency (TF-IDF) web search algorithm to enhance the searching efficiency for the relevant information. Therefore, the neural network is an advanced mechanism for verifying the semantic relationship between words and texts in a particular document. Our approach uses the Word2vec-Tiling to capture the unstructured data semantic features of words in the selected text while naturally integrating the word frequency and semantic relation.
The Word2vec-Tiling is a technique for subdividing texts into multi paragraph units that represent passages, or subtopics. The discourse cues for identifying major subtopic shifts are patterns of semantic co-occurrence and distribution. The algorithm (Word2vec-Tiling) is fully implemented and is shown to produce segmentation that corresponds well to Artificial Intelligence (AI) judgments of the subtopic boundaries of multiple texts. Multi paragraph subtopic segmentation should be useful for many text analysis tasks, including text semantic relation and word embedding. By following the Word2vec-Tiling method that we have proposed, it is easily possible to find out the related information about the searched keyword on Internet that we have already proved by improving the equation of TF and RL algorithm.
Author(s)
AHMED MD TANVIR
Issued Date
2019
Awarded Date
2019. 8
Type
Dissertation
Publisher
부경대학교
URI
https://repository.pknu.ac.kr:8443/handle/2021.oak/23461
http://pknu.dcollection.net/common/orgView/200000221307
Affiliation
Pukyong National University Graduate School
Department
대학원 컴퓨터공학과
Advisor
Mokdong Chung
Table Of Contents
1. Introduction 3
1.1. Introduction 3
1.2. Research Purpose 5
1.3. Scope and organization of this thesis 5
2. Related Work 8
2.1. Web Crawler background 8
2.2. Web search and Web crawling 9
2.3. Crawler Architecture 10
2.4. Unstructured data 12
2.5. Word2vec 13
2.5.1. Continuous Bag-of-Words Model (CBOW) 15
2.5.2. Skip-gram model 18
2.5.2.1. Reinforcement Learning framework 19
2.5.2.2. Reinforcement Learning overview 20
2.5.2.3. An overview of TF-IDF 22
2.5.2.4. Mathematical Framework 22
3. Design of Web Crawler 24
3.1. Web Crawler architecture 24
3.2. Hierarchical Clustering 25
3.2.1. Semantic Analysis 26
3.2.2. Word Embedding 27
3.3. System Overview 28
3.3.1. Continuous Bag-of-Words (CBOW) Model 29
3.3.2. CBOW Model System Overview 31
3.3.3. Skip-gram Model 33
3.3.4. Skip-gram Model overview 34
3.3.5. Optimizing Word2vec-Tiling Computational Efficiency 36
3.4. Word2vec-Tiling (CBOW and Skip-gram Model) 37
3.4.1. Reinforcement Learning (RL) Algorithm 38
3.4.2. TF-IDF Algorithm 41
3.4.3. TF-IDF Working Diagram 44
3.4.4. Development Tools 45
4. Implementation and Evaluation 46
4.1 Implementation 46
4.1.1. Implementation of CBOW Model 47
4.1.2. Implementation of Skip-gram Model 51
4.1.3. Implementation of Word2vec-Tiling 54
4.2. Implementation of TF-IDF Algorithm 56
4.2. Implementation of Reinforcement Learning Algorithm 57
4.3. Evaluation 58
4.3.1. Experimantal Evaluation 59
4.3.2. TF-IDF Algorithm Experimantal Evaluation 62
4.3.3. Reinforcement Learning Algorithm Experimantal Evaluation 64
4.4. Data Set 64
4.5. Performance Measurement 65
5. Conclusion 67
5.1. Conclusion 67
5.2. Future Work 69
Degree
Master
Appears in Collections:
대학원 > 컴퓨터공학과
Authorize & License
  • Authorize공개
Files in This Item:

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.