performance - Best solution to host a crawler? -
i have crawler crawl few different domains new posts/content. total amount of content hundred of thousands of pages, , there lot of new content added each day. able crawl through content, need crawler crawling 24/7.
currently host crawler script on same server site crawler adding content to, , i'm able run cronjob run script during nighttime, because when do, website stops working because load of script. in other words, pretty crappy solution.
so wonder best option kind of solution?
is possible keep running crawler same host, somehow balancing load script doesnt kill website?
what kind of host/server looking host crawler? there other specifications need normal web host?
the crawler saves images crawls. if host crawler on secondary server, how save images on server of site? guess dont want chmod 777 on uploads-folder , allow put files on server.
i decided choose amazon web services host crawler both have sqs queues auto scalable instances. have s3 can store images.
i decided rewrite whole crawler python instead of php more take advantage of things such queues , keep app going 100% of time, instead of using cronjobs.
so did, , means
i set elastic beanstalk application crawler set "worker" , listening sqs store domains need crawled. sqs "queue" can save each domain needs crawled, , crawler listen queue , fetch 1 domain @ time until queue done. there no need "cronjobs" or that, queue data it, send crawler. meaning crawler 100% of time, 24/7.
the application set auto scaling, meaning when have many domains in queue, set second, third, fourth etc... instance/crawler speed process. think very important point wants set crawler.
- all images saved on s3 instance. means images not saved on server of crawler , can fetched , worked with.
the results have been great. when had php crawler running on cronjobs every 15min, crawl 600 urls per hour. can without problems crawl 10'000+ urls per hour, more depending on how set auto scaling.
Comments
Post a Comment