performance - Best solution to host a crawler? -


i have crawler crawl few different domains new posts/content. total amount of content hundred of thousands of pages, , there lot of new content added each day. able crawl through content, need crawler crawling 24/7.

currently host crawler script on same server site crawler adding content to, , i'm able run cronjob run script during nighttime, because when do, website stops working because load of script. in other words, pretty crappy solution.

so wonder best option kind of solution?

  • is possible keep running crawler same host, somehow balancing load script doesnt kill website?

  • what kind of host/server looking host crawler? there other specifications need normal web host?

  • the crawler saves images crawls. if host crawler on secondary server, how save images on server of site? guess dont want chmod 777 on uploads-folder , allow put files on server.

i decided choose amazon web services host crawler both have sqs queues auto scalable instances. have s3 can store images.

i decided rewrite whole crawler python instead of php more take advantage of things such queues , keep app going 100% of time, instead of using cronjobs.

so did, , means

  1. i set elastic beanstalk application crawler set "worker" , listening sqs store domains need crawled. sqs "queue" can save each domain needs crawled, , crawler listen queue , fetch 1 domain @ time until queue done. there no need "cronjobs" or that, queue data it, send crawler. meaning crawler 100% of time, 24/7.

  2. the application set auto scaling, meaning when have many domains in queue, set second, third, fourth etc... instance/crawler speed process. think very important point wants set crawler.

  3. all images saved on s3 instance. means images not saved on server of crawler , can fetched , worked with.

the results have been great. when had php crawler running on cronjobs every 15min, crawl 600 urls per hour. can without problems crawl 10'000+ urls per hour, more depending on how set auto scaling.


Comments

Popular posts from this blog

c# - How to get the current UAC mode -

postgresql - Lazarus + Postgres: incomplete startup packet -

javascript - Ajax jqXHR.status==0 fix error -