Skip to content

File and Output Parsing

Matthijs Gielen edited this page Jul 15, 2025 · 3 revisions

File and Output Parsing

Related Pages

Related topics: Data Model and Flow, Extending Harbinger Playbook and Output Templating

Relevant source files

File and Output Parsing

File and output parsing in Harbinger is a critical component responsible for processing raw data obtained from C2 implants, proxy jobs, and file uploads. This system extracts structured information such as processes, file lists, and specific indicators, which are then used to enrich the project's database and graph models. The parsing capabilities allow for automated analysis, timeline generation, and identification of key findings, contributing to situational awareness and detection risk assessments.

The parsing framework is designed with extensibility, utilizing abstract base classes for both output and file parsing, allowing for the addition of new parsers for various data formats and command outputs. The entire process is integrated into the system's workflow engine, ensuring that data is processed efficiently and consistently.

Output Parsing

Output parsing focuses on analyzing text-based responses from C2 tasks to extract relevant entities. This is managed through the OutputParser abstract base class. Sources: harbinger/src/harbinger/worker/output.py:27-44

OutputParser Abstract Base Class

The OutputParser class defines the interface for all text output parsers. Each concrete parser must implement match and parse methods. Sources: harbinger/src/harbinger/worker/output.py:27-44

# harbinger/src/harbinger/worker/output.py
class OutputParser(abc.ABC):
    needle: list[str] = []
    labels: list[str] = []

    def __init__(self, db: AsyncSession) -> None:
        self.db = db

    @abc.abstractmethod
    async def match(self, text: str) -> bool:
        raise NotImplementedError("This should be implemented")

    @abc.abstractmethod
    async def parse(
        self,
        text: str,
        c2_implant_id: str | UUID4 | None = None,
        c2_output_id: str | UUID4 | None = None,
        file_id: str | UUID4 | None = None,
    ) -> None:
        raise NotImplementedError("This should be implemented")

Sources: harbinger/src/harbinger/worker/output.py:27-44

The match method determines if a given text output is relevant to the parser, while the parse method performs the actual extraction and storage of information. The needle attribute can be used to quickly identify relevant text. Sources: harbinger/src/harbinger/worker/output.py:31-35

A conceptual class diagram for OutputParser:

classDiagram
    direction TD
    abc.ABC <|-- OutputParser
    OutputParser : +needle: list~str~
    OutputParser : +labels: list~str~
    OutputParser : +__init__(db: AsyncSession)
    OutputParser : +match(text: str): bool
    OutputParser : +parse(text: str, c2_implant_id: str | UUID4 | None, c2_output_id: str | UUID4 | None, file_id: str | UUID4 | None): None
Loading

Sources: harbinger/src/harbinger/worker/output.py:27-44

An example of a specific output parser is parse_ccache, which handles Kerberos credential cache files. Sources: harbinger/src/harbinger/worker/output.py:21, harbinger/src/harbinger/worker/activities.py:59

File Parsing

File parsing handles the analysis of uploaded or downloaded files to extract metadata, content, and structured information. This is facilitated by the BaseFileParser abstract base class. Sources: harbinger/src/harbinger/worker/files/parsers.py:26-30

BaseFileParser Abstract Base Class

The BaseFileParser defines the contract for parsers that process entire files. It includes an abstract parse method and a base_parse utility method. Sources: harbinger/src/harbinger/worker/files/parsers.py:26-44

# harbinger/src/harbinger/worker/files/parsers.py
class BaseFileParser(ABC):
    """Base fileparser"""

    @abstractmethod
    async def parse(
        self,
        db: AsyncSession,
        graph_db: AsyncNeo4jSession,
        tmpdirname: str,
        tmpfilename: str,
        file: schemas.File,
    ) -> list[schemas.File]:
        return []

    async def base_parse(
        self, db: AsyncSession, graph_db: AsyncNeo4jSession, file: schemas.File
    ) -> list[schemas.File]:
        with tempfile.TemporaryDirectory() as tmpdirname:
            path = os.path.join(tmpdirname, file.filename)
            async with aiofiles.open(path, "wb") as f:
                data = await download_file(file.path)

Sources: harbinger/src/harbinger/worker/files/parsers.py:26-44

The base_parse method handles the common task of downloading the file to a temporary directory before specific parsing logic is applied. Sources: harbinger/src/harbinger/worker/files/parsers.py:39-44

A conceptual class diagram for BaseFileParser:

classDiagram
    direction TD
    ABC <|-- BaseFileParser
    BaseFileParser : +parse(db, graph_db, tmpdirname, tmpfilename, file): list~schemas.File~
    BaseFileParser : +base_parse(db, graph_db, file): list~schemas.File~
Loading

Sources: harbinger/src/harbinger/worker/files/parsers.py:26-44

An example of a file processing utility is process_harbinger_yaml, which parses YAML configurations, including C2 server types and playbook templates. Sources: harbinger/src/harbinger/worker/files/utils.py:65-75

Workflow Integration: ParseFile

Parsing operations are integrated into the system's workflows, primarily through the ParseFile workflow. This workflow is initiated when a file is uploaded or downloaded. Sources: harbinger/src/harbinger/files/router.py:44-50, harbinger/src/harbinger/indexer/download_files.py:48-52

File Upload and Parsing Flow

When a file is uploaded via the API, it is first saved to storage, and then the ParseFile workflow is triggered to process it. Sources: harbinger/src/harbinger/files/router.py:44-50

# harbinger/src/harbinger/files/router.py
@router.post("/upload_file/", response_model=schemas.File, tags=["files"])
async def create_upload_file(
    file: UploadFile,
    file_type: schemas.FileType | str,
    db: AsyncSession = Depends(get_db),
    user: models.User = Depends(current_active_user),
):
    # ... file saving logic ...
    file_db = await crud.get_file(file_id)  # type: ignore
    if file_db:
        client = await get_client()
        await client.start_workflow(
            ParseFile.run,
            str(file_id),
            id=str(uuid.uuid4()),
            task_queue=constants.FILE_PROCESSING_TASK_QUEUE,
        )
    return file_db

Sources: harbinger/src/harbinger/files/router.py:33-50

The ParseFile workflow, defined in harbinger/src/harbinger/worker/activities.py, coordinates the parsing activities. Sources: harbinger/src/harbinger/worker/activities.py:56

# harbinger/src/harbinger/worker/activities.py
@activity.defn
async def parse_file(file_id: str) -> None:
    # ... parsing logic ...
    pass # Actual implementation would call file parsers

Sources: harbinger/src/harbinger/worker/activities.py:56

The general flow for file and output parsing:

graph TD
    A[User Uploads File / C2 Task Output] --> B{Is File or Text?};
    B -- File --> C[Save File to Storage];
    C --> D[Trigger ParseFile Workflow];
    B -- Text Output --> E[Save C2 Output];
    E --> F[Trigger Output Parsers];
    D --> G[ParseFile Activity];
    F --> G;
    G --> H{Identify Parser};
    H -- File Type --> I[Call BaseFileParser];
    H -- Text Content --> J[Call OutputParser];
    I --> K[Extract Data];
    J --> K;
    K --> L[Store in Database/Graph];
    L --> M[Update UI/Timeline];
Loading

Sources: harbinger/src/harbinger/files/router.py:33-50, harbinger/src/harbinger/worker/output.py:27-44, harbinger/src/harbinger/worker/files/parsers.py:26-44, harbinger/src/harbinger/worker/activities.py:56

Data Structures for Parsed Output

Harbinger utilizes Protobuf messages for structured communication, including the representation of task output and file information.

TaskOutputRequest

The TaskOutputRequest message is used to send C2 task output, which can include processes and file lists. Sources: harbinger/src/harbinger/proto/v1/messages_pb2.py:1982-2266, go/proto/v1/messages.pb.go:50-65

// go/proto/v1/messages.pb.go
message TaskOutputRequest {
	string internal_id = 1;
	string c2_server_id = 2;
	string response_text = 3;
	string output_type = 4;
	string timestamp = 5;
	string internal_task_id = 6;
	string bucket = 7;
	string path = 8;
	repeated Process processes = 9;
	FileList file_list = 11;
}

Sources: go/proto/v1/messages.pb.go:50-65

Key fields of TaskOutputRequest:

Field Name Type Description Source
internal_id string Unique ID for the output. go/proto/v1/messages.pb.go:51
c2_server_id string ID of the C2 server. go/proto/v1/messages.pb.go:52
response_text string Raw text response from the C2 task. go/proto/v1/messages.pb.go:53
output_type string Type of output (e.g., stdout, stderr). go/proto/v1/messages.pb.go:54
processes Process List of processes identified in the output. go/proto/v1/messages.pb.go:61
file_list FileList A list of files and directory information. go/proto/v1/messages.pb.go:62
internal_task_id string Internal ID of the task that generated the output. go/proto/v1/messages.pb.go:56

OutputType Enum

The OutputType enum specifies the type of output received from a task. Sources: harbinger/src/harbinger/proto/v1/messages_pb2.py:3074-3163, go/proto/v1/messages.pb.go:121-125

// go/proto/v1/messages.pb.go
enum OutputType {
	OUTPUT_TYPE_UNSPECIFIED = 0;
	OUTPUT_TYPE_STDOUT = 1;
	OUTPUT_TYPE_STDERR = 2;
}

Sources: go/proto/v1/messages.pb.go:121-125

Process and ShareFile Messages

These messages define the structure for process information and shared file details, respectively, which can be part of a TaskOutputRequest. Sources: harbinger/src/harbinger/proto/v1/messages_pb2.py:1453-1773, go/proto/v1/messages.pb.go:27-40

classDiagram
    direction TD
    TaskOutputRequest --> Process : contains
    TaskOutputRequest --> FileList : contains
    FileList --> ShareFile : contains
    Process : +process_id: int32
    Process : +architecture: string
    Process : +name: string
    Process : +user: string
    Process : +bin_path: string
    Process : +parent_process_id: int32
    ShareFile : +type: string
    ShareFile : +size: int64
    ShareFile : +last_accessed: string
    ShareFile : +last_modified: string
    ShareFile : +created: string
    ShareFile : +unc_path: string
    ShareFile : +name: string
Loading

Sources: harbinger/src/harbinger/proto/v1/messages_pb2.py:1453-1773, go/proto/v1/messages.pb.go:27-40

Database Schemas for Parsed Results

Parsed data is persisted in the database using specific schemas.

ParseResult Schema

The ParseResult schema captures the outcome of a parsing operation, linking it to the source file or task output. Sources: harbinger/src/harbinger/database/schemas.py:270-279, harbinger/interface/src/models.ts:153-162

Field Name Type Description Source
file_id string ID of the file that was parsed. harbinger/interface/src/models.ts:154
c2_task_id string ID of the C2 task. harbinger/interface/src/models.ts:155
c2_task_output_id string ID of the C2 task output. harbinger/interface/src/models.ts:156
proxy_job_output_id string ID of the proxy job output. harbinger/interface/src/models.ts:157
proxy_job_id string ID of the proxy job. harbinger/interface/src/models.ts:158
parser string Name of the parser used. harbinger/interface/src/models.ts:159
log string Log or summary of the parsing process. harbinger/interface/src/models.ts:160
id string Unique ID for the parse result. harbinger/interface/src/models.ts:161
time_created string Timestamp of creation. harbinger/interface/src/models.ts:162
labels Array<string> Labels associated with the result. harbinger/interface/src/models.ts:163

Highlight Schema

The Highlight schema stores specific findings or interesting strings identified during parsing. Sources: harbinger/src/harbinger/database/schemas.py:260-269, harbinger/interface/src/models.ts:139-152

Field Name Type Description Source
file_id string ID of the file where the highlight was found. harbinger/interface/src/models.ts:140
c2_task_id string ID of the C2 task. harbinger/interface/src/models.ts:141
c2_task_output_id string ID of the C2 task output. harbinger/interface/src/models.ts:142
proxy_job_output_id string ID of the proxy job output. harbinger/interface/src/models.ts:143
proxy_job_id string ID of the proxy job. harbinger/interface/src/models.ts:144
parse_result_id string ID of the parse result this highlight belongs to. harbinger/interface/src/models.ts:145
rule_id number ID of the rule that triggered the highlight. harbinger/interface/src/models.ts:146
rule_type string Type of the rule. harbinger/interface/src/models.ts:147
hit string The actual string that was highlighted. harbinger/interface/src/models.ts:148
start string Start position of the highlight. harbinger/interface/src/models.ts:149
end string End position of the highlight. harbinger/interface/src/models.ts:150
line number Line number where the highlight was found. harbinger/interface/src/models.ts:151
id string Unique ID for the highlight. harbinger/interface/src/models.ts:152
time_created string Timestamp of creation. harbinger/interface/src/models.ts:152
labels Array<string> Labels associated with the highlight. harbinger/interface/src/models.ts:152

Relationships between Parsing Schemas

The ParseResult and Highlight schemas are designed to link back to the source data (files, C2 tasks, proxy jobs) and to each other, forming a comprehensive record of extracted intelligence.

erDiagram
    File ||--o{ ParseResult : "has"
    C2_Task ||--o{ ParseResult : "has"
    C2_Task_Output ||--o{ ParseResult : "has"
    Proxy_Job ||--o{ ParseResult : "has"
    Proxy_Job_Output ||--o{ ParseResult : "has"
    ParseResult ||--o{ Highlight : "generates"
    File {
        string id
        string filename
        string path
    }
    C2_Task {
        string id
    }
    C2_Task_Output {
        string id
    }
    Proxy_Job {
        string id
    }
    Proxy_Job_Output {
        string id
    }
    ParseResult {
        string id
        string file_id
        string c2_task_id
        string c2_task_output_id
        string proxy_job_output_id
        string proxy_job_id
        string parser
        string log
        string time_created
        string[] labels
    }
    Highlight {
        string id
        string file_id
        string c2_task_id
        string c2_task_output_id
        string proxy_job_output_id
        string proxy_job_id
        string parse_result_id
        int rule_id
        string rule_type
        string hit
        string start
        string end
        int line
        string time_created
        string[] labels
    }
Loading

Sources: harbinger/src/harbinger/database/schemas.py:270-279, harbinger/src/harbinger/database/schemas.py:260-269, harbinger/interface/src/models.ts:153-162, harbinger/interface/src/models.ts:139-152

Conclusion

File and output parsing in Harbinger provides a structured approach to extracting actionable intelligence from various data sources. By defining clear interfaces for parsers and integrating them into robust workflows, the system ensures that raw C2 outputs and uploaded files are transformed into meaningful data points, enriching the overall operational picture and supporting further analysis and automation.

Clone this wiki locally