Collect All Cloudflare Logs Cost-Effectively

Collect all your Cloudflare logs to improve IT and Security operations

image of Collect All Cloudflare Logs Cost-Effectively

A content delivery network (CDN) consists of a geographically distributed group of services that work together to provide reliable and faster delivery of Web content. When configured correctly, a CDN may improve security by providing DDoS mitigation. Cloudflare CDN is used by organizations across industries to improve their website load times, increase content availability, and reduce bandwidth costs.  

Why forward Cloudflare logs to EraSearch?

Wouldn’t it be great to understand which type of traffic the CDN is not allowing through or to watch for errors at your origin server? Or to look for speed issues occurring across your customers’ sites and all the way through the origin server? Cloudflare logs provide detailed information about incoming requests to your applications. These logs are critical for optimizing user experience, giving you more context around problems to enable faster troubleshooting. 

EraSearch is the perfect observability and analytics platform for ingesting, indexing, storing, and querying all your Cloudflare logs. With a multi-tier architecture for scaling high-volume logs, EraSearch enables modern IT teams to cost-effectively combine request logs with other data sources, such as application server logs, for end-to-end visibility. This allows for more and faster ways to identify and debug infrastructure and operational issues. And without the operational complexity and cost of running your log management infrastructure, you can finally focus on leveraging data - no matter the volume, variety and velocity of log files - to improve the user experience. 

Because EraSearch supports the  Elasticsearch API, our customers can use the tools they already know and love such as Vector. The difference is that we eliminate the complexity and cost of storing all this data in a hot Elasticsearch cluster. A major advantage of EraSearch is that users don’t need to define schemas ahead of time. And for Splunk users, you can now cost-effectively collect all your Cloudflare and other logs to improve IT and security operations.

Collecting Cloudflare logs using Vector

Cloudflare provides two ways to get logs to your destination: Logpull and Logpush. Logpull requires you to write custom scripts using its API to repeatedly download logs. Logpush, on the other hand, is a more efficient tool to push logs to your destination simply by telling Cloudflare where to send them. Cloudflare Logpush supports pushing logs to cloud service providers and other services such as Splunk HTTP Event Collector (HEC).

Now, we’ll show you how to:

  •  Enable Cloudflare Logpush to push logs to Vector in the Splunk HEC format

  •  Use the Splunk HEC provider in Vector to accept Cloudflare logs 

  •  Use the Elasticsearch-compatible format from Vector to send logs to EraSearch

Cloudflare setup

To enable Cloudflare Logpush to Vector in the Splunk HEC format, you first need to create an API token as referenced in the documentation here.

After the API key is created, the curl command below will need to be executed for each Cloudflare zone, varying the ZONE_ID value each time.  The API request to Cloudflare will need the following variable values to be passed in the command:

  • ZONE_ID of the domain from which you want to send logs. You will need to send a curl command for each zone. The ZONE_ID can be gathered from the bottom right of your Cloudflare domain homepage.

  • HOSTNAME is the name of the Cloudflare configuration.

  • EMAIL of the Cloudflare account.

  • CF_API_TOKEN is the Cloudflare API token retrieved above.

  • VECTOR_ENDPOINT is the IP or hostname at which Vector will receive the Splunk HEC formatted Cloudflare logs.

  • VECTOR_AUTH_TOKEN is a string token used by Vector for authenticating requests from Cloudflare. This will be needed by Vector and can be set to any string.

  • CHANNEL_ID is the unique channel on which this data will come. You need to have a unique CHANNEL_ID for each Cloudflare hostname. This is important so that you receive only the data relevant to a particular hostname.

curl -s -H "Authorization: Bearer ${CF_API_TOKEN}" \
"https://api.cloudflare.com/client/v4/zones/<ZONE_ID>/logpush/jobs" -X POST \
-d {"name":"<HOSTNAME>","destination_conf": "splunk://<VECTOR_ENDPOINT>:8088/services/collector/raw?channel=<CHANNEL_ID>&insecure-skip-verify=true&sourcetype=cloudflare:json&header_Authorization=<VECTOR_AUTH_TOKEN>","logpull_options": "fields=RayID,EdgeStartTimestamp,CacheCacheStatus,CacheTieredFill,ClientASN,ClientCountry,ClientDeviceType,ClientIPClass,ClientMTLSAuthCertFingerprint,ClientMTLSAuthStatus,ClientSSLCipher,ClientSSLProtocol,ClientSrcPort,ClientTCPRTTMs,ClientXRequestedWith,ClientRequestBytes,ClientRequestHost,ClientRequestMethod,ClientRequestPath,ClientRequestProtocol,ClientRequestReferer,ClientRequestScheme,ClientRequestSource,ClientRequestURI,ClientRequestUserAgent,EdgeCFConnectingO2O,EdgeColoCode,EdgeColoID,EdgeEndTimestamp,EdgePathingOp,EdgePathingSrc,EdgePathingStatus,EdgeRateLimitAction,EdgeRateLimitID,EdgeRequestHost,EdgeResponseBodyBytes,EdgeResponseBytes,EdgeResponseCompressionRatio,EdgeResponseContentType,EdgeResponseStatus,EdgeServerIP,EdgeTimeToFirstByteMs,FirewallMatchesActions,FirewallMatchesRuleIDs,FirewallMatchesSources,OriginDNSResponseTimeMs,OriginIP,OriginRequestHeaderSendDurationMs,OriginSSLProtocol,OriginTCPHandshakeDurationMs,OriginTLSHandshakeDurationMs,OriginResponseBytes,OriginResponseDurationMs,OriginResponseHTTPExpires,OriginResponseHTTPLastModified,OriginResponseTime,OriginResponseStatus,OriginResponseHeaderReceiveDurationMs,WAFAction,WAFFlags,WAFMatchedVar,WAFProfile,WAFRuleID,WAFRuleMessage,WorkerSubrequestCount,WorkerSubrequest,WorkerStatus,WorkerCPUTime\u0026timestamps=rfc3339","dataset": "http_requests"}'

After successfully issuing the request above, the Logpush configuration will be active and is ready to push logs to Vector. Now, let’s configure Vector to receive requests.

Using Vector

Vector is a lightweight and efficient log collector that serves as a great frontend to EraSearch. After setting up Cloudflare to push logs to Splunk as described above, the next step is to configure Vector to accept the Cloudflare logs in Splunk HEC format. This configuration will also be used to forward the logs from Vector to EraSearch. 

You can install Vector on a number of Linux operating systems, as well as MacOS, Windows, and Raspbian. The installation creates a vector.toml configuration file (located in /etc/vector by default). In this configuration file, you need to specify key components of a Vector observability pipeline:

  • Sources define where Vector should pull data from, or in this case, from where Vector should receive data pushed to it. 

  • Transforms shape your data as it’s transported by Vector, via parsing, filtering, sampling or aggregating. You can have a number of transforms in your pipeline. 

  • Sinks create a destination for log forwarding.

In this scenario, we’ll configure Splunk HEC as a Vector source, transform the data, and set up EraSearch as a Vector sink via an Elasticsearch-compatible API. If you’re running a self-hosted EraSearch environment, add the following configuration in vector.toml to forward Splunk-formatted Cloudflare logs to EraSearch. 

#Configure Splunk HEC as a  Vector source for Cloudflare logs.
[sources.cloudflare]
  type = "splunk_hec"
  address = "0.0.0.0:${VECPORT}"
  tls.enabled = true
  tls.crt_file = "vector.eradb.com.crt"
  tls.key_file = "vector.eradb.com.key"

# The filter transform allows you to forward only the logs that come in on .splunk_channel.  
[transforms.splunk_cf]
  type = "filter"
  inputs = ["cloudflare"] 
  condition = 'includes(["${channelID}","${channelID2}","${channelID3}"], .splunk_channel)'

# This transform splits a single Splunk input with newline-delimited messages into multiple logs.
[transforms.explode_cf]
type = "remap"
inputs = ["splunk_cf"]
source = '''
.message, err=strip_whitespace(.message)
   if (match(.message,r'\n') != null ){
   . = split(.message,r'\n')
}
'''

# This transform parses the log message into JSON format.
[transforms.parse_logs_cf]
type = "remap"
inputs = ["explode_cf"]
source = '''
. = parse_json!(.message)
'''

# Configure EraSearch as a Vector sink.
[sinks.es2_cloudflare]
  type = "elasticsearch"
  inputs = ["parse_logs_cf"]
  endpoint="https://${erasearch_host}:443"
  index = "logs-cf"
  auth.user="${erasearch_username}"
  auth.password="${erasearch_password}"
  auth.strategy="basic"
  healthcheck.enabled = false
  batch.max_bytes=5000000
  request.concurrency = "adaptive"
  request.headers.Content-Type = "application/json"
  buffer.max_events = 5000
  buffer.type = "memory"
  buffer.when_full = "block"

While not required, the additional configuration below will expose Vector metrics in a Prometheus-compatible format, providing insight into how Vector itself is performing. Cloudflare can push a lot of data, so it’s important to make sure Vector is healthy to ensure your data is received by EraSearch.

[sources.internal]
type = "internal_metrics"
[sinks.prometheus]
type = "prometheus_exporter"
inputs = ["internal"]
address = "0.0.0.0:9598"

Finally, restart the Vector service. If everything is configured correctly and you’re collecting data, you should now see your logs in EraSearch.

If you’re a Cloudflare customer and are looking for a better, more cost-effective way to gain faster insights from your log data, contact us to learn more about EraSearch.

Tags

Stay Informed