
1. Why Splunk Architecture Matters in SIEM and DevSecOps
Splunk is often described as a Data-to-Everything platform, but in SIEM and DevSecOps environments, its real value comes from how it processes data at scale.
Handling hundreds of gigabytes-or even terabytes-of logs per day is not just about collecting data. It requires :
- Predictable ingestion performance
- Fast and repeatable search execution
- Clear separation between ingestion, processing, and analysis
- Operational flexibility when detection logic changes
Splunk's architecture is designed specifically to meet these requirements, and every major component exists to enforce this separation.
2. Architectural Principle : Separation of Responsibility

At a high level, Splunk enforces three distince responsibilities:
| Responsibility | Component |
| Data Collection | Universal Forwarder (UF) |
| Data processing & indexing | Indexer |
| Search & analytics | Search Head (SH) |
This separation is not optional - it is the core reason Splunk sclaes reliably.
If collection, parsing, indexing, and searching happened in the same layer :
- Ingestion spikes would slow analyst searches.
- Complex searches would impact data collection.
- Scaling decisions would become unpredictable.
Splunk avoids this by isolating workloads by role.
3. Universal Forwarder (UF) : Why "Less" Is More
The Universal Forwarder is a lightweight agent responsible only for data ingestion.
UF Responsibilities
- Performs basic line breaking
- Collects logs from files, ports, WinEventLog, etc. (based on inputs.conf )
- Forwards raw data to Indexers via TCP (configured via outputs.conf)
- Uses very low CPU and memory, suitable for deployment on thousands of servers
What UF Does Not Do -- and Why
UF does not:
- Extract fields
- Normalize timestamps
- Apply parsing rules
This is a deliberate designe choice.
If UF performed full parsing:
- Parsing logic would need to be maintained on thousands of endpoints.
- Endpoint resource usage would increase.
- Ingestion performance would vary across hosts
By keeping UF "dumb," Splunk ensures:
- Consistent ingestion behavior
- Centralized parsing control
- Easy large-scale deployment
This explains why props.conf and transforms.conf are never meant to live on UF.
4. Reliable Data Transfer : UF → Indexer
UF forwards data to Indexers using TCP with acknowledgement (ACK).
This choice prioritizes reliability over raw throughput:
- Data is not dropped during network instability.
- Indexers can apply backpressure when overloaded.
- Ingestion remains consistent under high load.
In SIEM environments, delayed logs are acceptable. Lost logs are not.
5. Indexer : The Real Engine of Splunk
The Indexer is the most critical and resource-intensive component in Splunk.
Indexer Responsibilities
- Receives raw data from Forwarders
- Parses and indexes incoming data according to rules defined in props.conf and transforms.conf
- Applies parsing rules (timestamp extraction, field extraction)
- Indexes data and stores rawdata and metadata (tsidx)
- Serves search requests from Search Heads
Becuase it handles CPU-heavy parsing and disk-heavy indexing, the Indexer is almost always the first scaling bottleneck.
5-1. Parsing vs. Indexing : Why They Are Separate Stages
Parsing Stage
Parsing occurs before data is written to disk and include:
- Timestamp extraction
- Line Breaking (if not already done)
- Application of rules from props.conf and transformers.conf
Parsing defines how events are structured, not how they are searched.
Once parsing decisions are made, they are difficult to undo.
Indexing Stage
Indexing is responsible for
- Writing raw events to disk.
- Creating tsidx (time-series index) files
- Organizing data into searchable buckets.
At this point, data becomes searchable - but not fully analyzed.
This separation explains why :
- Parsing errors are hard to recover from
- Index-time decisions must be conservative
- Not all fields should be extracted at index-time.
5-2. tsidx Files : Why Splunk Searches Are Fast
When data is indexed, Splunk generates tsidx files.
What tsidx Contains
- Event timestamps
- Index-time fields (index, sourcetype, host, source)
- Term metadata used for fast filtering
What tsidx Does Not Contain
- Full raw event text
- Searh-time extracted fields
- Dynamic or computed fields
During a search :
- Splunk first consults tsidx to find candidate events
- Only then does it read rawdata if needed
This explains several important behaviors:
- Time-based searches (tsidx searches) are extremely fast.
- Reading rawdata is the most expensive operation
- Reducing rawdata reads is the key to performance tuning.

6. Search Head (SH) : Execution Without Processing
The Search Head is not a data processing node.
SH Responsibilities
- Executes SPL (Search Processing Language)
- Coordinates distributed searches to Indexers
- Builds dashboards, alerts, and reports
- Manages search schedules and alert triggers
Search Heads send search logic to Indexers, and Indexers return results.
This explains why:
- Adding SHs improves concurrency.
- Adding SHs does not fix slow searches
- Search performance issues usually point to Indexers.
7. Splunk Configuration Philosophy
Splunk configuration is entirely driven by configuration files composed of stanzas and settings. Each file has a clear and non-overlapping responsibility, and understanding where and when each file is applied is critical to correct data onboarding and troubleshooting.
Splunk does not process configuration files arbitratily. Each file participates in a specific stage of the ingestion pipeline.
7-1. inputs.conf - What data is collected
Role in the pipeline
- Stage : Data ingestion (on Forwarder or Indexer)
- Purpose : Define what data to collect and basic metadata.
"Which data should Splunk read, and how should it be labeled initially?"
[monitor:///var/log/secure]
sourcetype=linux_secure
index=security
What actually happens
- Splunk starts monitoring /var/log/secure
- Each new line is treated as raw input
- Metadata is attached :
- sourcetype = linux_secure
- index = security
7-2. outputs.conf - Where the data goes
Role in the pipeline
- Stage : Data forwarding
- Purpose : Define how and where data is sent
"Where should collected data be forwarded?"
[tcpout:indexer_group]
server = 10.0.1.10:9997,10.0.1.11:9997
What actually happens
- Data collected via inputs.conf is sent to Indexers.
- Communication uses TCP with acknowlegement
- Indexers can apply backpressure if overloaded
- outputs.conf does not control parsing, it only controls transport.
7-3. props.conf - What should be parsed
Role in the pipeline
- Stage : Parsing (on Indexer)
- Purpose : Define how raw data should be interpreted
"How should Splunk interpret this raw data?"
What actually does
- Event line breaking
- Timestamp extraction
- Character encoding
- Declaring parsing rules
- Linking to transforms
[linux_secure]
TIME_PREFIX = ^
TIME_FORMAT = %b %d %H:%M:%S
MAX_TIMESTAMP_LOOKAHEAD = 15
TRANSFORMS-extract_user = extract_user_from_message
What actually happens
- When data with sourcetype=linux_secure arrives
- Splunk extracts timestamps based on these rules
- Correct timestamps are assigned before indexing.
- 'TRANSFORMS-extract_user' tells Splunk that "Apply a transform called extract_user_from_message"
7-4. transforms.conf - How data is modified
Role in the pipeline
- Stage : Parsing (called from props.conf)
- Purpose : Perform regex-based actions
"What exact action should Splunk perform on the data?"
[extract_user_from_message]
REGEX = user=(\w+)
FORMAT = user::$1
What actually happens
- Regex is applied to the raw event
- Captured value is extracted as a field
- Fields becomes available at search time
- The extracted field is not indexed. (It exists only when searching)
7-5. Why props.conf and transforms.conf Must Be Seperate
Splunk deliberately separates:
| File | Responsibility |
| props.conf | Declares intent |
| transforms.conf | Implements logic |
Operational benefits
- Easier debugging → "Is the transform called?" vs. "Is the regex wrong?"
- Safer change management → Reuse transforms without modifying parsing rules
- Clear ownership → Parsing logic vs. Execution logic
8. Conclusion : Architecture First, Detection Second
In DevSecOps and SOC environments, successful Splunk usage follows a clear principle :
Architecture first, detection second.
By designing ingestion, parsing, and indexing correctly, Splunk becomes a robust foundation for real-time detection, observability, and scalable security analytics -- allowing detection logic to evolve without compromising performance or reliablity.
Resources
- Splunk Architecture: Components and Best Practices
- Components of a Splunk Enterprise deployment | Splunk Docs
- splunk-validated-architectures-ko.pdf
Powered By. ChatGPT
'Splunk' 카테고리의 다른 글
| [Multi-cloud] 멀티클라우드 보안 로그와 IAM 위임 구조 이해하기 - AWS, Azure, GCP를 아우르는 SIEM 설계 (0) | 2025.12.25 |
|---|---|
| [Splunk] Mental Models and Core Search Commands (2) | 2025.12.16 |
| [Splunk] Splunk Topologies (0) | 2025.12.05 |