-
Notifications
You must be signed in to change notification settings - Fork 14
File and Output Parsing
Related topics: Data Model and Flow, Extending Harbinger Playbook and Output Templating
Relevant source files
- harbinger/src/harbinger/worker/activities.py
- harbinger/src/harbinger/worker/files/parsers.py
- harbinger/src/harbinger/worker/output.py
- harbinger/interface/src/models.ts
- harbinger/src/harbinger/proto/v1/messages_pb2.py
- go/proto/v1/messages.pb.go
- harbinger/src/harbinger/database/schemas.py
- harbinger/src/harbinger/files/router.py
- harbinger/src/harbinger/worker/files/utils.py
File and output parsing in Harbinger is a critical component responsible for processing raw data obtained from C2 implants, proxy jobs, and file uploads. This system extracts structured information such as processes, file lists, and specific indicators, which are then used to enrich the project's database and graph models. The parsing capabilities allow for automated analysis, timeline generation, and identification of key findings, contributing to situational awareness and detection risk assessments.
The parsing framework is designed with extensibility, utilizing abstract base classes for both output and file parsing, allowing for the addition of new parsers for various data formats and command outputs. The entire process is integrated into the system's workflow engine, ensuring that data is processed efficiently and consistently.
Output parsing focuses on analyzing text-based responses from C2 tasks to extract relevant entities. This is managed through the OutputParser abstract base class.
Sources: harbinger/src/harbinger/worker/output.py:27-44
The OutputParser class defines the interface for all text output parsers. Each concrete parser must implement match and parse methods.
Sources: harbinger/src/harbinger/worker/output.py:27-44
# harbinger/src/harbinger/worker/output.py
class OutputParser(abc.ABC):
needle: list[str] = []
labels: list[str] = []
def __init__(self, db: AsyncSession) -> None:
self.db = db
@abc.abstractmethod
async def match(self, text: str) -> bool:
raise NotImplementedError("This should be implemented")
@abc.abstractmethod
async def parse(
self,
text: str,
c2_implant_id: str | UUID4 | None = None,
c2_output_id: str | UUID4 | None = None,
file_id: str | UUID4 | None = None,
) -> None:
raise NotImplementedError("This should be implemented")Sources: harbinger/src/harbinger/worker/output.py:27-44
The match method determines if a given text output is relevant to the parser, while the parse method performs the actual extraction and storage of information. The needle attribute can be used to quickly identify relevant text.
Sources: harbinger/src/harbinger/worker/output.py:31-35
A conceptual class diagram for OutputParser:
classDiagram
direction TD
abc.ABC <|-- OutputParser
OutputParser : +needle: list~str~
OutputParser : +labels: list~str~
OutputParser : +__init__(db: AsyncSession)
OutputParser : +match(text: str): bool
OutputParser : +parse(text: str, c2_implant_id: str | UUID4 | None, c2_output_id: str | UUID4 | None, file_id: str | UUID4 | None): None
Sources: harbinger/src/harbinger/worker/output.py:27-44
An example of a specific output parser is parse_ccache, which handles Kerberos credential cache files.
Sources: harbinger/src/harbinger/worker/output.py:21, harbinger/src/harbinger/worker/activities.py:59
File parsing handles the analysis of uploaded or downloaded files to extract metadata, content, and structured information. This is facilitated by the BaseFileParser abstract base class.
Sources: harbinger/src/harbinger/worker/files/parsers.py:26-30
The BaseFileParser defines the contract for parsers that process entire files. It includes an abstract parse method and a base_parse utility method.
Sources: harbinger/src/harbinger/worker/files/parsers.py:26-44
# harbinger/src/harbinger/worker/files/parsers.py
class BaseFileParser(ABC):
"""Base fileparser"""
@abstractmethod
async def parse(
self,
db: AsyncSession,
graph_db: AsyncNeo4jSession,
tmpdirname: str,
tmpfilename: str,
file: schemas.File,
) -> list[schemas.File]:
return []
async def base_parse(
self, db: AsyncSession, graph_db: AsyncNeo4jSession, file: schemas.File
) -> list[schemas.File]:
with tempfile.TemporaryDirectory() as tmpdirname:
path = os.path.join(tmpdirname, file.filename)
async with aiofiles.open(path, "wb") as f:
data = await download_file(file.path)Sources: harbinger/src/harbinger/worker/files/parsers.py:26-44
The base_parse method handles the common task of downloading the file to a temporary directory before specific parsing logic is applied.
Sources: harbinger/src/harbinger/worker/files/parsers.py:39-44
A conceptual class diagram for BaseFileParser:
classDiagram
direction TD
ABC <|-- BaseFileParser
BaseFileParser : +parse(db, graph_db, tmpdirname, tmpfilename, file): list~schemas.File~
BaseFileParser : +base_parse(db, graph_db, file): list~schemas.File~
Sources: harbinger/src/harbinger/worker/files/parsers.py:26-44
An example of a file processing utility is process_harbinger_yaml, which parses YAML configurations, including C2 server types and playbook templates.
Sources: harbinger/src/harbinger/worker/files/utils.py:65-75
Parsing operations are integrated into the system's workflows, primarily through the ParseFile workflow. This workflow is initiated when a file is uploaded or downloaded.
Sources: harbinger/src/harbinger/files/router.py:44-50, harbinger/src/harbinger/indexer/download_files.py:48-52
When a file is uploaded via the API, it is first saved to storage, and then the ParseFile workflow is triggered to process it.
Sources: harbinger/src/harbinger/files/router.py:44-50
# harbinger/src/harbinger/files/router.py
@router.post("/upload_file/", response_model=schemas.File, tags=["files"])
async def create_upload_file(
file: UploadFile,
file_type: schemas.FileType | str,
db: AsyncSession = Depends(get_db),
user: models.User = Depends(current_active_user),
):
# ... file saving logic ...
file_db = await crud.get_file(file_id) # type: ignore
if file_db:
client = await get_client()
await client.start_workflow(
ParseFile.run,
str(file_id),
id=str(uuid.uuid4()),
task_queue=constants.FILE_PROCESSING_TASK_QUEUE,
)
return file_dbSources: harbinger/src/harbinger/files/router.py:33-50
The ParseFile workflow, defined in harbinger/src/harbinger/worker/activities.py, coordinates the parsing activities.
Sources: harbinger/src/harbinger/worker/activities.py:56
# harbinger/src/harbinger/worker/activities.py
@activity.defn
async def parse_file(file_id: str) -> None:
# ... parsing logic ...
pass # Actual implementation would call file parsersSources: harbinger/src/harbinger/worker/activities.py:56
The general flow for file and output parsing:
graph TD
A[User Uploads File / C2 Task Output] --> B{Is File or Text?};
B -- File --> C[Save File to Storage];
C --> D[Trigger ParseFile Workflow];
B -- Text Output --> E[Save C2 Output];
E --> F[Trigger Output Parsers];
D --> G[ParseFile Activity];
F --> G;
G --> H{Identify Parser};
H -- File Type --> I[Call BaseFileParser];
H -- Text Content --> J[Call OutputParser];
I --> K[Extract Data];
J --> K;
K --> L[Store in Database/Graph];
L --> M[Update UI/Timeline];
Sources: harbinger/src/harbinger/files/router.py:33-50, harbinger/src/harbinger/worker/output.py:27-44, harbinger/src/harbinger/worker/files/parsers.py:26-44, harbinger/src/harbinger/worker/activities.py:56
Harbinger utilizes Protobuf messages for structured communication, including the representation of task output and file information.
The TaskOutputRequest message is used to send C2 task output, which can include processes and file lists.
Sources: harbinger/src/harbinger/proto/v1/messages_pb2.py:1982-2266, go/proto/v1/messages.pb.go:50-65
// go/proto/v1/messages.pb.go
message TaskOutputRequest {
string internal_id = 1;
string c2_server_id = 2;
string response_text = 3;
string output_type = 4;
string timestamp = 5;
string internal_task_id = 6;
string bucket = 7;
string path = 8;
repeated Process processes = 9;
FileList file_list = 11;
}Sources: go/proto/v1/messages.pb.go:50-65
Key fields of TaskOutputRequest:
| Field Name | Type | Description | Source |
|---|---|---|---|
internal_id |
string |
Unique ID for the output. | go/proto/v1/messages.pb.go:51 |
c2_server_id |
string |
ID of the C2 server. | go/proto/v1/messages.pb.go:52 |
response_text |
string |
Raw text response from the C2 task. | go/proto/v1/messages.pb.go:53 |
output_type |
string |
Type of output (e.g., stdout, stderr). | go/proto/v1/messages.pb.go:54 |
processes |
Process |
List of processes identified in the output. | go/proto/v1/messages.pb.go:61 |
file_list |
FileList |
A list of files and directory information. | go/proto/v1/messages.pb.go:62 |
internal_task_id |
string |
Internal ID of the task that generated the output. | go/proto/v1/messages.pb.go:56 |
The OutputType enum specifies the type of output received from a task.
Sources: harbinger/src/harbinger/proto/v1/messages_pb2.py:3074-3163, go/proto/v1/messages.pb.go:121-125
// go/proto/v1/messages.pb.go
enum OutputType {
OUTPUT_TYPE_UNSPECIFIED = 0;
OUTPUT_TYPE_STDOUT = 1;
OUTPUT_TYPE_STDERR = 2;
}Sources: go/proto/v1/messages.pb.go:121-125
These messages define the structure for process information and shared file details, respectively, which can be part of a TaskOutputRequest.
Sources: harbinger/src/harbinger/proto/v1/messages_pb2.py:1453-1773, go/proto/v1/messages.pb.go:27-40
classDiagram
direction TD
TaskOutputRequest --> Process : contains
TaskOutputRequest --> FileList : contains
FileList --> ShareFile : contains
Process : +process_id: int32
Process : +architecture: string
Process : +name: string
Process : +user: string
Process : +bin_path: string
Process : +parent_process_id: int32
ShareFile : +type: string
ShareFile : +size: int64
ShareFile : +last_accessed: string
ShareFile : +last_modified: string
ShareFile : +created: string
ShareFile : +unc_path: string
ShareFile : +name: string
Sources: harbinger/src/harbinger/proto/v1/messages_pb2.py:1453-1773, go/proto/v1/messages.pb.go:27-40
Parsed data is persisted in the database using specific schemas.
The ParseResult schema captures the outcome of a parsing operation, linking it to the source file or task output.
Sources: harbinger/src/harbinger/database/schemas.py:270-279, harbinger/interface/src/models.ts:153-162
| Field Name | Type | Description | Source |
|---|---|---|---|
file_id |
string |
ID of the file that was parsed. | harbinger/interface/src/models.ts:154 |
c2_task_id |
string |
ID of the C2 task. | harbinger/interface/src/models.ts:155 |
c2_task_output_id |
string |
ID of the C2 task output. | harbinger/interface/src/models.ts:156 |
proxy_job_output_id |
string |
ID of the proxy job output. | harbinger/interface/src/models.ts:157 |
proxy_job_id |
string |
ID of the proxy job. | harbinger/interface/src/models.ts:158 |
parser |
string |
Name of the parser used. | harbinger/interface/src/models.ts:159 |
log |
string |
Log or summary of the parsing process. | harbinger/interface/src/models.ts:160 |
id |
string |
Unique ID for the parse result. | harbinger/interface/src/models.ts:161 |
time_created |
string |
Timestamp of creation. | harbinger/interface/src/models.ts:162 |
labels |
Array<string> |
Labels associated with the result. | harbinger/interface/src/models.ts:163 |
The Highlight schema stores specific findings or interesting strings identified during parsing.
Sources: harbinger/src/harbinger/database/schemas.py:260-269, harbinger/interface/src/models.ts:139-152
| Field Name | Type | Description | Source |
|---|---|---|---|
file_id |
string |
ID of the file where the highlight was found. | harbinger/interface/src/models.ts:140 |
c2_task_id |
string |
ID of the C2 task. | harbinger/interface/src/models.ts:141 |
c2_task_output_id |
string |
ID of the C2 task output. | harbinger/interface/src/models.ts:142 |
proxy_job_output_id |
string |
ID of the proxy job output. | harbinger/interface/src/models.ts:143 |
proxy_job_id |
string |
ID of the proxy job. | harbinger/interface/src/models.ts:144 |
parse_result_id |
string |
ID of the parse result this highlight belongs to. | harbinger/interface/src/models.ts:145 |
rule_id |
number |
ID of the rule that triggered the highlight. | harbinger/interface/src/models.ts:146 |
rule_type |
string |
Type of the rule. | harbinger/interface/src/models.ts:147 |
hit |
string |
The actual string that was highlighted. | harbinger/interface/src/models.ts:148 |
start |
string |
Start position of the highlight. | harbinger/interface/src/models.ts:149 |
end |
string |
End position of the highlight. | harbinger/interface/src/models.ts:150 |
line |
number |
Line number where the highlight was found. | harbinger/interface/src/models.ts:151 |
id |
string |
Unique ID for the highlight. | harbinger/interface/src/models.ts:152 |
time_created |
string |
Timestamp of creation. | harbinger/interface/src/models.ts:152 |
labels |
Array<string> |
Labels associated with the highlight. | harbinger/interface/src/models.ts:152 |
The ParseResult and Highlight schemas are designed to link back to the source data (files, C2 tasks, proxy jobs) and to each other, forming a comprehensive record of extracted intelligence.
erDiagram
File ||--o{ ParseResult : "has"
C2_Task ||--o{ ParseResult : "has"
C2_Task_Output ||--o{ ParseResult : "has"
Proxy_Job ||--o{ ParseResult : "has"
Proxy_Job_Output ||--o{ ParseResult : "has"
ParseResult ||--o{ Highlight : "generates"
File {
string id
string filename
string path
}
C2_Task {
string id
}
C2_Task_Output {
string id
}
Proxy_Job {
string id
}
Proxy_Job_Output {
string id
}
ParseResult {
string id
string file_id
string c2_task_id
string c2_task_output_id
string proxy_job_output_id
string proxy_job_id
string parser
string log
string time_created
string[] labels
}
Highlight {
string id
string file_id
string c2_task_id
string c2_task_output_id
string proxy_job_output_id
string proxy_job_id
string parse_result_id
int rule_id
string rule_type
string hit
string start
string end
int line
string time_created
string[] labels
}
Sources: harbinger/src/harbinger/database/schemas.py:270-279, harbinger/src/harbinger/database/schemas.py:260-269, harbinger/interface/src/models.ts:153-162, harbinger/interface/src/models.ts:139-152
File and output parsing in Harbinger provides a structured approach to extracting actionable intelligence from various data sources. By defining clear interfaces for parsers and integrating them into robust workflows, the system ensures that raw C2 outputs and uploaded files are transformed into meaningful data points, enriching the overall operational picture and supporting further analysis and automation.