performance - Best solution to host a crawler? -
i have crawler crawl few different domains new posts/content. total amount of content hundred of thousands of pages, , there lot of new content added each day. able crawl through content, need crawler crawling 24/7.
currently host crawler script on same server site crawler adding content to, , i'm able run cronjob run script during nighttime, because when do, website stops working because load of script. in other words, pretty crappy solution.
so wonder best option kind of solution?
- is possible keep running crawler same host, somehow balancing load script doesnt kill website? 
- what kind of host/server looking host crawler? there other specifications need normal web host? 
- the crawler saves images crawls. if host crawler on secondary server, how save images on server of site? guess dont want chmod 777 on uploads-folder , allow put files on server. 
i decided choose amazon web services host crawler both have sqs queues auto scalable instances. have s3 can store images.
i decided rewrite whole crawler python instead of php more take advantage of things such queues , keep app going 100% of time, instead of using cronjobs.
so did, , means
- i set elastic beanstalk application crawler set "worker" , listening sqs store domains need crawled. sqs "queue" can save each domain needs crawled, , crawler listen queue , fetch 1 domain @ time until queue done. there no need "cronjobs" or that, queue data it, send crawler. meaning crawler 100% of time, 24/7. 
- the application set auto scaling, meaning when have many domains in queue, set second, third, fourth etc... instance/crawler speed process. think very important point wants set crawler. 
- all images saved on s3 instance. means images not saved on server of crawler , can fetched , worked with.
the results have been great. when had php crawler running on cronjobs every 15min, crawl 600 urls per hour. can without problems crawl 10'000+ urls per hour, more depending on how set auto scaling.
Comments
Post a Comment