PUKYONG

GUIDE OAK 국가리포지터리 부경대학교

검색

PUKYONG Repository 대학원 컴퓨터공학과

Design and Implementation of an Efficient Web Crawling

Metadata Downloads

Abstract: The key part for the attainment of the World Wide Web (WWW) is its large size and the lack of centralized control over its contents. Both manifestations are also the most important source of problems for locating information. The web is a context in which traditional Information Retrieval (IR) methods are challenged, and given the volume of the web and its speed of change, the coverage of modern search engines is comparatively petty. Moreover, the distribution of quality is very skewed, and interesting pages are scarce in comparison with the rest of the content.
The web crawler is a program, which is commonly used by search engines to find the new brainchild on the internet. The use of crawlers has made the web easier for users. In this thesis, we have used unstructured data by structuralization to collect data from the web pages. Our system is able to choose the word near our keyword in more than one document using unstructured way. Neighbor data were collected on the keyword through word2vec. In this research, we have introduced (Word2vec-Tiling) a new technique to collect data from the web pages. In our research, we used word2vec in the tiling method. Word2vec is considered as one of the easiest and most popular algorithms for identifying the relationship weight among the documents. Thus, the user will get accurate information related to his keyword. In this thesis, specific keywords were used by Word2vec-Tiling to identify naturally fused word frequencies, semantic relationships, and directional text-ranks. Finally, our system proposed a competent web search crawling algorithm that is derived from word2vec and Reinforcement Learning (RL) and Term Frequency and Inverse Document Frequency (TF-IDF) web search algorithm to enhance the searching efficiency for the relevant information. Therefore, the neural network is an advanced mechanism for verifying the semantic relationship between words and texts in a particular document. Our approach uses the Word2vec-Tiling to capture the unstructured data semantic features of words in the selected text while naturally integrating the word frequency and semantic relation.
The Word2vec-Tiling is a technique for subdividing texts into multi paragraph units that represent passages, or subtopics. The discourse cues for identifying major subtopic shifts are patterns of semantic co-occurrence and distribution. The algorithm (Word2vec-Tiling) is fully implemented and is shown to produce segmentation that corresponds well to Artificial Intelligence (AI) judgments of the subtopic boundaries of multiple texts. Multi paragraph subtopic segmentation should be useful for many text analysis tasks, including text semantic relation and word embedding. By following the Word2vec-Tiling method that we have proposed, it is easily possible to find out the related information about the searched keyword on Internet that we have already proved by improving the equation of TF and RL algorithm.

Author(s): AHMED MD TANVIR

Issued Date: 2019

Awarded Date: 2019. 8

Type: Dissertation

Publisher: 부경대학교

URI: https://repository.pknu.ac.kr:8443/handle/2021.oak/23461
http://pknu.dcollection.net/common/orgView/200000221307

Affiliation: Pukyong National University Graduate School

Department: 대학원 컴퓨터공학과

Advisor: Mokdong Chung

Table Of Contents: 1. Introduction 3
1.1. Introduction 3
1.2. Research Purpose 5
1.3. Scope and organization of this thesis 5
2. Related Work 8
2.1. Web Crawler background 8
2.2. Web search and Web crawling 9
2.3. Crawler Architecture 10
2.4. Unstructured data 12
2.5. Word2vec 13
2.5.1. Continuous Bag-of-Words Model (CBOW) 15
2.5.2. Skip-gram model 18
2.5.2.1. Reinforcement Learning framework 19
2.5.2.2. Reinforcement Learning overview 20
2.5.2.3. An overview of TF-IDF 22
2.5.2.4. Mathematical Framework 22
3. Design of Web Crawler 24
3.1. Web Crawler architecture 24
3.2. Hierarchical Clustering 25
3.2.1. Semantic Analysis 26
3.2.2. Word Embedding 27
3.3. System Overview 28
3.3.1. Continuous Bag-of-Words (CBOW) Model 29
3.3.2. CBOW Model System Overview 31
3.3.3. Skip-gram Model 33
3.3.4. Skip-gram Model overview 34
3.3.5. Optimizing Word2vec-Tiling Computational Efficiency 36
3.4. Word2vec-Tiling (CBOW and Skip-gram Model) 37
3.4.1. Reinforcement Learning (RL) Algorithm 38
3.4.2. TF-IDF Algorithm 41
3.4.3. TF-IDF Working Diagram 44
3.4.4. Development Tools 45
4. Implementation and Evaluation 46
4.1 Implementation 46
4.1.1. Implementation of CBOW Model 47
4.1.2. Implementation of Skip-gram Model 51
4.1.3. Implementation of Word2vec-Tiling 54
4.2. Implementation of TF-IDF Algorithm 56
4.2. Implementation of Reinforcement Learning Algorithm 57
4.3. Evaluation 58
4.3.1. Experimantal Evaluation 59
4.3.2. TF-IDF Algorithm Experimantal Evaluation 62
4.3.3. Reinforcement Learning Algorithm Experimantal Evaluation 64
4.4. Data Set 64
4.5. Performance Measurement 65
5. Conclusion 67
5.1. Conclusion 67
5.2. Future Work 69

Degree: Master

Appears in Collections:: 대학원 > 컴퓨터공학과

Show Simple Item RecordShow Full Item Record

Authorize & License

Authorize공개

qrcode

트윗하기