File and Output Parsing

Related Pages

Related topics: Data Model and Flow, Extending Harbinger Playbook and Output Templating

Relevant source files

File and Output Parsing

File and output parsing in Harbinger is a critical component responsible for processing raw data obtained from C2 implants, proxy jobs, and file uploads. This system extracts structured information such as processes, file lists, and specific indicators, which are then used to enrich the project's database and graph models. The parsing capabilities allow for automated analysis, timeline generation, and identification of key findings, contributing to situational awareness and detection risk assessments.

The parsing framework is designed with extensibility, utilizing abstract base classes for both output and file parsing, allowing for the addition of new parsers for various data formats and command outputs. The entire process is integrated into the system's workflow engine, ensuring that data is processed efficiently and consistently.

Output Parsing

Output parsing focuses on analyzing text-based responses from C2 tasks to extract relevant entities. This is managed through the OutputParser abstract base class. Sources: harbinger/src/harbinger/worker/output.py:27-44

`OutputParser` Abstract Base Class

The OutputParser class defines the interface for all text output parsers. Each concrete parser must implement match and parse methods. Sources: harbinger/src/harbinger/worker/output.py:27-44

# harbinger/src/harbinger/worker/output.py
class OutputParser(abc.ABC):
    needle: list[str] = []
    labels: list[str] = []

    def __init__(self, db: AsyncSession) -> None:
        self.db = db

    @abc.abstractmethod
    async def match(self, text: str) -> bool:
        raise NotImplementedError("This should be implemented")

    @abc.abstractmethod
    async def parse(
        self,
        text: str,
        c2_implant_id: str | UUID4 | None = None,
        c2_output_id: str | UUID4 | None = None,
        file_id: str | UUID4 | None = None,
    ) -> None:
        raise NotImplementedError("This should be implemented")

Sources: harbinger/src/harbinger/worker/output.py:27-44

The match method determines if a given text output is relevant to the parser, while the parse method performs the actual extraction and storage of information. The needle attribute can be used to quickly identify relevant text. Sources: harbinger/src/harbinger/worker/output.py:31-35

A conceptual class diagram for OutputParser:

classDiagram
    direction TD
    abc.ABC <|-- OutputParser
    OutputParser : +needle: list~str~
    OutputParser : +labels: list~str~
    OutputParser : +__init__(db: AsyncSession)
    OutputParser : +match(text: str): bool
    OutputParser : +parse(text: str, c2_implant_id: str | UUID4 | None, c2_output_id: str | UUID4 | None, file_id: str | UUID4 | None): None

Sources: harbinger/src/harbinger/worker/output.py:27-44

An example of a specific output parser is parse_ccache, which handles Kerberos credential cache files. Sources: harbinger/src/harbinger/worker/output.py:21, harbinger/src/harbinger/worker/activities.py:59

File Parsing

File parsing handles the analysis of uploaded or downloaded files to extract metadata, content, and structured information. This is facilitated by the BaseFileParser abstract base class. Sources: harbinger/src/harbinger/worker/files/parsers.py:26-30

`BaseFileParser` Abstract Base Class

The BaseFileParser defines the contract for parsers that process entire files. It includes an abstract parse method and a base_parse utility method. Sources: harbinger/src/harbinger/worker/files/parsers.py:26-44

# harbinger/src/harbinger/worker/files/parsers.py
class BaseFileParser(ABC):
    """Base fileparser"""

    @abstractmethod
    async def parse(
        self,
        db: AsyncSession,
        graph_db: AsyncNeo4jSession,
        tmpdirname: str,
        tmpfilename: str,
        file: schemas.File,
    ) -> list[schemas.File]:
        return []

    async def base_parse(
        self, db: AsyncSession, graph_db: AsyncNeo4jSession, file: schemas.File
    ) -> list[schemas.File]:
        with tempfile.TemporaryDirectory() as tmpdirname:
            path = os.path.join(tmpdirname, file.filename)
            async with aiofiles.open(path, "wb") as f:
                data = await download_file(file.path)

Sources: harbinger/src/harbinger/worker/files/parsers.py:26-44

The base_parse method handles the common task of downloading the file to a temporary directory before specific parsing logic is applied. Sources: harbinger/src/harbinger/worker/files/parsers.py:39-44

A conceptual class diagram for BaseFileParser:

classDiagram
    direction TD
    ABC <|-- BaseFileParser
    BaseFileParser : +parse(db, graph_db, tmpdirname, tmpfilename, file): list~schemas.File~
    BaseFileParser : +base_parse(db, graph_db, file): list~schemas.File~

Sources: harbinger/src/harbinger/worker/files/parsers.py:26-44

An example of a file processing utility is process_harbinger_yaml, which parses YAML configurations, including C2 server types and playbook templates. Sources: harbinger/src/harbinger/worker/files/utils.py:65-75

Workflow Integration: `ParseFile`

Parsing operations are integrated into the system's workflows, primarily through the ParseFile workflow. This workflow is initiated when a file is uploaded or downloaded. Sources: harbinger/src/harbinger/files/router.py:44-50, harbinger/src/harbinger/indexer/download_files.py:48-52

File Upload and Parsing Flow

When a file is uploaded via the API, it is first saved to storage, and then the ParseFile workflow is triggered to process it. Sources: harbinger/src/harbinger/files/router.py:44-50

# harbinger/src/harbinger/files/router.py
@router.post("/upload_file/", response_model=schemas.File, tags=["files"])
async def create_upload_file(
    file: UploadFile,
    file_type: schemas.FileType | str,
    db: AsyncSession = Depends(get_db),
    user: models.User = Depends(current_active_user),
):
    # ... file saving logic ...
    file_db = await crud.get_file(file_id)  # type: ignore
    if file_db:
        client = await get_client()
        await client.start_workflow(
            ParseFile.run,
            str(file_id),
            id=str(uuid.uuid4()),
            task_queue=constants.FILE_PROCESSING_TASK_QUEUE,
        )
    return file_db

Sources: harbinger/src/harbinger/files/router.py:33-50

The ParseFile workflow, defined in harbinger/src/harbinger/worker/activities.py, coordinates the parsing activities. Sources: harbinger/src/harbinger/worker/activities.py:56

# harbinger/src/harbinger/worker/activities.py
@activity.defn
async def parse_file(file_id: str) -> None:
    # ... parsing logic ...
    pass # Actual implementation would call file parsers

Sources: harbinger/src/harbinger/worker/activities.py:56

The general flow for file and output parsing:

graph TD
    A[User Uploads File / C2 Task Output] --> B{Is File or Text?};
    B -- File --> C[Save File to Storage];
    C --> D[Trigger ParseFile Workflow];
    B -- Text Output --> E[Save C2 Output];
    E --> F[Trigger Output Parsers];
    D --> G[ParseFile Activity];
    F --> G;
    G --> H{Identify Parser};
    H -- File Type --> I[Call BaseFileParser];
    H -- Text Content --> J[Call OutputParser];
    I --> K[Extract Data];
    J --> K;
    K --> L[Store in Database/Graph];
    L --> M[Update UI/Timeline];

Sources: harbinger/src/harbinger/files/router.py:33-50, harbinger/src/harbinger/worker/output.py:27-44, harbinger/src/harbinger/worker/files/parsers.py:26-44, harbinger/src/harbinger/worker/activities.py:56

Data Structures for Parsed Output

Harbinger utilizes Protobuf messages for structured communication, including the representation of task output and file information.

`TaskOutputRequest`

The TaskOutputRequest message is used to send C2 task output, which can include processes and file lists. Sources: harbinger/src/harbinger/proto/v1/messages_pb2.py:1982-2266, go/proto/v1/messages.pb.go:50-65

// go/proto/v1/messages.pb.go
message TaskOutputRequest {
	string internal_id = 1;
	string c2_server_id = 2;
	string response_text = 3;
	string output_type = 4;
	string timestamp = 5;
	string internal_task_id = 6;
	string bucket = 7;
	string path = 8;
	repeated Process processes = 9;
	FileList file_list = 11;
}

Sources: go/proto/v1/messages.pb.go:50-65

Key fields of TaskOutputRequest:

Field Name	Type	Description	Source
`internal_id`	`string`	Unique ID for the output.	go/proto/v1/messages.pb.go:51
`c2_server_id`	`string`	ID of the C2 server.	go/proto/v1/messages.pb.go:52
`response_text`	`string`	Raw text response from the C2 task.	go/proto/v1/messages.pb.go:53
`output_type`	`string`	Type of output (e.g., stdout, stderr).	go/proto/v1/messages.pb.go:54
`processes`	`Process`	List of processes identified in the output.	go/proto/v1/messages.pb.go:61
`file_list`	`FileList`	A list of files and directory information.	go/proto/v1/messages.pb.go:62
`internal_task_id`	`string`	Internal ID of the task that generated the output.	go/proto/v1/messages.pb.go:56

`OutputType` Enum

The OutputType enum specifies the type of output received from a task. Sources: harbinger/src/harbinger/proto/v1/messages_pb2.py:3074-3163, go/proto/v1/messages.pb.go:121-125

// go/proto/v1/messages.pb.go
enum OutputType {
	OUTPUT_TYPE_UNSPECIFIED = 0;
	OUTPUT_TYPE_STDOUT = 1;
	OUTPUT_TYPE_STDERR = 2;
}

Sources: go/proto/v1/messages.pb.go:121-125

`Process` and `ShareFile` Messages

These messages define the structure for process information and shared file details, respectively, which can be part of a TaskOutputRequest. Sources: harbinger/src/harbinger/proto/v1/messages_pb2.py:1453-1773, go/proto/v1/messages.pb.go:27-40

classDiagram
    direction TD
    TaskOutputRequest --> Process : contains
    TaskOutputRequest --> FileList : contains
    FileList --> ShareFile : contains
    Process : +process_id: int32
    Process : +architecture: string
    Process : +name: string
    Process : +user: string
    Process : +bin_path: string
    Process : +parent_process_id: int32
    ShareFile : +type: string
    ShareFile : +size: int64
    ShareFile : +last_accessed: string
    ShareFile : +last_modified: string
    ShareFile : +created: string
    ShareFile : +unc_path: string
    ShareFile : +name: string

Sources: harbinger/src/harbinger/proto/v1/messages_pb2.py:1453-1773, go/proto/v1/messages.pb.go:27-40

Database Schemas for Parsed Results

Parsed data is persisted in the database using specific schemas.

`ParseResult` Schema

The ParseResult schema captures the outcome of a parsing operation, linking it to the source file or task output. Sources: harbinger/src/harbinger/database/schemas.py:270-279, harbinger/interface/src/models.ts:153-162

Field Name	Type	Description	Source
`file_id`	`string`	ID of the file that was parsed.	harbinger/interface/src/models.ts:154
`c2_task_id`	`string`	ID of the C2 task.	harbinger/interface/src/models.ts:155
`c2_task_output_id`	`string`	ID of the C2 task output.	harbinger/interface/src/models.ts:156
`proxy_job_output_id`	`string`	ID of the proxy job output.	harbinger/interface/src/models.ts:157
`proxy_job_id`	`string`	ID of the proxy job.	harbinger/interface/src/models.ts:158
`parser`	`string`	Name of the parser used.	harbinger/interface/src/models.ts:159
`log`	`string`	Log or summary of the parsing process.	harbinger/interface/src/models.ts:160
`id`	`string`	Unique ID for the parse result.	harbinger/interface/src/models.ts:161
`time_created`	`string`	Timestamp of creation.	harbinger/interface/src/models.ts:162
`labels`	`Array<string>`	Labels associated with the result.	harbinger/interface/src/models.ts:163

`Highlight` Schema

The Highlight schema stores specific findings or interesting strings identified during parsing. Sources: harbinger/src/harbinger/database/schemas.py:260-269, harbinger/interface/src/models.ts:139-152

Field Name	Type	Description	Source
`file_id`	`string`	ID of the file where the highlight was found.	harbinger/interface/src/models.ts:140
`c2_task_id`	`string`	ID of the C2 task.	harbinger/interface/src/models.ts:141
`c2_task_output_id`	`string`	ID of the C2 task output.	harbinger/interface/src/models.ts:142
`proxy_job_output_id`	`string`	ID of the proxy job output.	harbinger/interface/src/models.ts:143
`proxy_job_id`	`string`	ID of the proxy job.	harbinger/interface/src/models.ts:144
`parse_result_id`	`string`	ID of the parse result this highlight belongs to.	harbinger/interface/src/models.ts:145
`rule_id`	`number`	ID of the rule that triggered the highlight.	harbinger/interface/src/models.ts:146
`rule_type`	`string`	Type of the rule.	harbinger/interface/src/models.ts:147
`hit`	`string`	The actual string that was highlighted.	harbinger/interface/src/models.ts:148
`start`	`string`	Start position of the highlight.	harbinger/interface/src/models.ts:149
`end`	`string`	End position of the highlight.	harbinger/interface/src/models.ts:150
`line`	`number`	Line number where the highlight was found.	harbinger/interface/src/models.ts:151
`id`	`string`	Unique ID for the highlight.	harbinger/interface/src/models.ts:152
`time_created`	`string`	Timestamp of creation.	harbinger/interface/src/models.ts:152
`labels`	`Array<string>`	Labels associated with the highlight.	harbinger/interface/src/models.ts:152

Relationships between Parsing Schemas

The ParseResult and Highlight schemas are designed to link back to the source data (files, C2 tasks, proxy jobs) and to each other, forming a comprehensive record of extracted intelligence.

erDiagram
    File ||--o{ ParseResult : "has"
    C2_Task ||--o{ ParseResult : "has"
    C2_Task_Output ||--o{ ParseResult : "has"
    Proxy_Job ||--o{ ParseResult : "has"
    Proxy_Job_Output ||--o{ ParseResult : "has"
    ParseResult ||--o{ Highlight : "generates"
    File {
        string id
        string filename
        string path
    }
    C2_Task {
        string id
    }
    C2_Task_Output {
        string id
    }
    Proxy_Job {
        string id
    }
    Proxy_Job_Output {
        string id
    }
    ParseResult {
        string id
        string file_id
        string c2_task_id
        string c2_task_output_id
        string proxy_job_output_id
        string proxy_job_id
        string parser
        string log
        string time_created
        string[] labels
    }
    Highlight {
        string id
        string file_id
        string c2_task_id
        string c2_task_output_id
        string proxy_job_output_id
        string proxy_job_id
        string parse_result_id
        int rule_id
        string rule_type
        string hit
        string start
        string end
        int line
        string time_created
        string[] labels
    }

Sources: harbinger/src/harbinger/database/schemas.py:270-279, harbinger/src/harbinger/database/schemas.py:260-269, harbinger/interface/src/models.ts:153-162, harbinger/interface/src/models.ts:139-152

Conclusion

File and output parsing in Harbinger provides a structured approach to extracting actionable intelligence from various data sources. By defining clear interfaces for parsers and integrating them into robust workflows, the system ensures that raw C2 outputs and uploaded files are transformed into meaningful data points, enriching the overall operational picture and supporting further analysis and automation.

Harbinger Wiki

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File and Output Parsing