Email Id Extractor Project from sites in Scrapy Python

def parsed(self, response):

    # emails list of uniqueemail set

    emails = list(self.uniqueemail)

    finalemail = []


    for email in emails:

        # avoid garbage value by using '.in' and '.com'

        # and append email ids to finalemail

        if ('.in' in email or '.com' in email or 'info' in email or 'org' in email):


            finalemail.append(email)


    # final unique email ids from geeksforgeeks site

    print('\n'*2)

    print("Emails scraped", finalemail)

    print('\n'*2)

Explanation of Parsed function: 
The above regex expression also leads to garbage values like select@1.13 in this scraping email id from geeksforgeeks, we know select@1.13 is not a email id. The parsed function filter applies filter that only takes emails containing '.com' and ".in".
 

Run the spider using following command - 

scrapy crawl spidername (spidername is name of spider)

Garbage value in scraped emails: 


Final scraped emails: 
 


 

Post a Comment (0)
Previous Post Next Post