Skip to content

Trio + asks + instrumentation as progress bar help needed #187

@rvencu

Description

@rvencu

Hi, first I am not sure this is the place to ask but I feel is most appropriate though.

I am running a classic mass download job with trio and asks libraries. As expected, I launch trio.run from the main thread, I create a nursery and use .start_soon method for every URL in the main function and I perform the task of actual download on the second function.

Now I want to use tqdm to monitor the progress and I am using this trio instrument:

class TrioProgress(trio.abc.Instrument):

    def __init__(self, total, notebook_mode=False, **kwargs):
        if notebook_mode:
            from tqdm.notebook import tqdm
        else:
            from tqdm import tqdm

        self.tqdm = tqdm(total=total, desc="Downloaded: [ 0 ] / Links ", **kwargs)

    def task_exited(self, task):
        if task.custom_sleep_data == 0:
            self.tqdm.update(7)
        if task.custom_sleep_data == 1:
            self.tqdm.update(7)
            self.tqdm.desc = self.tqdm.desc.split(":")[0] + ": [ " + str( int(self.tqdm.desc.split(":")[1].split(" ")[2]) + 1 ) + " ] / Links "
            self.tqdm.refresh()

Let ignore the details and focus on the main task of the progress bar, i.w. to tick once at every processed URL. I thought the second function is the place to add such lines:

async def request_image(datas, start_sampleid):
    tmp_data = []

    import asks
    asks.init("trio")

    session = asks.Session(connections=64)
    session.headers = {
        "User-Agent": "Googlebot-Image",
        "Accept-Language": "en-US",
        "Accept-Encoding": "gzip, deflate",
        "Referer": "https://www.google.com/",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    }

    async def _request(data, sample_id):
        url, alt_text, license = data
        *task = trio.lowlevel.current_task()*
        *task.custom_sleep_data = None*
        try:
            proces = process_img_content(
                await session.get(url, timeout=5, connection_timeout=40), alt_text, license, sample_id
            )
            if proces is not None:
                tmp_data.append(proces)
                *task.custom_sleep_data = 1*
        except Exception:
            return

Except that if I count the ticks they are not equal to the size of my URL list. So the progress bar is not answering the basic question: "how long until finish"

Experimenting with 1 tick at every exit from the second function, the intuitive way, I noticed the ticks are about 2.5 - 3 times more than expected. But depending on the actual URL list this can go up to 7 as in the above example.

I would like to understand what is happening and maybe find a way to properly count finished download tasks (successful or unsuccessful). Succesful ones I was able to count correctly by confirming the actual download but all others are in the mist...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions