I'm by no means a splunk expert, you should ask them, but I think it
scales pretty well. You can use multiple masters to receive and
load-balance logs, and you can distribute the searching map/reduce
style to leverage more cores. Search speed seems to be much more CPU
bound than I/O bound, the logs are pretty efficiently packed. *Works
for me* with ~ 15-20 EC2 instances and one central logging server. It
also keeps logs in tiered buckets, so things from 30 days ago are
available, but slower to search on where as yesterday's logs are
'hotter'.
On Thu, Apr 16, 2009 at 8:41 PM, Gabriel Ramuglia <gabe@vtunnel.com> wrote:
> Does this scale well? I'm running a web based proxy that generates an
> absolute ton of log files. Easily 40gb / week / server, with around 20
> servers. I'm looking to be able to store and search up to 7 days of
> logs. Currently, I only move logs from the individual servers onto a
> central server when I get a complaint, import it into mysql, and
> search it. The entire process, even for just one server, takes
> forever.
>
> On Thu, Apr 16, 2009 at 7:37 PM, W. Andrew Loe III <andrew@andrewloe.com> wrote:
>> Its commercial, but Splunk is amazing at this. I think you can process
>> a few hundred MB/day on the free version. http://splunk.com/
>>
>> You set up a light-weight forwarder on every node you are interested
>> in, and then it slurps the files up and relays them to a central
>> splunk installation. It will queue internally if the master goes away.
>> Tons of support for sending different files different directions etc.
>> We have it setup in the default Puppet payload so every log on every
>> server is always centralized and searchable.
>>
>> On Wed, Apr 15, 2009 at 8:44 AM, Michael Shadle <mike503@gmail.com> wrote:
>>> On Wed, Apr 15, 2009 at 7:06 AM, Dave Cheney <dave@cheney.net> wrote:
>>>
>>>> What about
>>>>
>>>> cat *.log | sort -k 4
>>>
>>> or just
>>>
>>> cat *whatever.log >today.log
>>>
>>> I assume the processing script can handle out-of-order requests. but I
>>> guess that might be an arrogant assumption. :)
>>>
>>> I do basically the same thing igor does, but would love to simplify it
>>> by just having Host: header counts for bytes (sent/received/total
>>> amount of bytes used, basically) and how many http requests. Logging
>>> just enough of that to a file and parsing it each night seems kinda
>>> amateur...
>>>
>>>
>>
>>
>
>