Skip to content

Conversation

@mnjowe
Copy link
Collaborator

@mnjowe mnjowe commented Jan 15, 2026

This PR addresses issue #1778

Scope

  1. Establish a single helper function for scaling population DataFrames to census totals
  2. Support both flat DataFrames (single-level columns) and MultiIndex column DataFrames
  3. Handle common input shapes where:
    - date is already the index
    - date is provided as a column and must be promoted to the index

@mnjowe mnjowe self-assigned this Jan 15, 2026
@mnjowe mnjowe linked an issue Jan 15, 2026 that may be closed by this pull request
# =========================================================
# Positive: Flat columns (date already index)
# =========================================================
census_pop = 1_500_000
Copy link
Collaborator

@marghe-molaro marghe-molaro Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mnjowe, I've just started reviewing this but I have a question!
I thought I remembered from one of the quarterly meeting discussions that we wanted to make the scaling census-year dependent (so that if the simulation starts in 2010 with a sim_pop_size_2010, but the nearest most available census is in e.g. 2014, scaling factor would then be sim_pop_size_2014 (which would just be the number of alive individuals in the sim in 2014) over the 2014 census pop. Is that the idea here, but the user would have to manually compute the simulated pop size at the time of the census?
Or is this PR not related to your generalisation of the demography module to Tanzania?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's absolutely right @marghe-molaro.

  1. This PR is related to my generalisation of the demography module to Tanzania
  2. scaling should indeed be census year dependent as you have explained above
  3. sim_pop_size_census_year should come from the population dataframe indexing the census year
  4. scaling factor should then be sim_pop_size_census_year/census_pop

Suggestion

I can modify the auto scaling function to receive population dataframe, census pop and census year thereby automating even the process of generating the scale factor. The only challenge is that this assumes every simulation run logs demography data which I'm not sure is always the case.

Copy link
Collaborator

@marghe-molaro marghe-molaro Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for clarifying @mnjowe!
I think given that this function would only ever be used in post-processing, it would be good for this scaling factor to be computed 'behind the scenes', s.t. the user never has to worry about what input to pass to the function in order to calculate it, as is currently done in the util function extract_results: if the user ops to do_scaling when extracting results, the scaling_factor is just looked up in the log.
So I think ideally we would retain this logic, but update how the scaling factor is computed - i.e. demography should schedule an event to log the scaling factor in the year of the census, based on the simulated pop size in that year.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks @marghe-molaro. I like your approach.
I will look more into the extract_result function to see how I can go updating the auto scaling function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think then you can pause your review for now until I push the updated version

@tbhallett
Copy link
Collaborator

tbhallett commented Jan 21, 2026

Hi @mnjowe and @marghe-molaro
Yes, I agree with all of that (getting scale_factor from "behind the scenes". Tip: as well as the demography log, it's also logged in the 'population' log which is ALWAYS on! That's where we get it from in extract_results)

@mnjowe --- please could you also just describe again the "use case" for this function? (What it does that extract_results() doesn't do).

@mnjowe
Copy link
Collaborator Author

mnjowe commented Jan 22, 2026

Hi @tbhallett . I have gone through the extract_results() function.
I think its doing most of what I wanted in this PR. The only thing we had suggested that I feel is not included is the ability to make a scale factor that's census year based.

With @marghe-molaro we were discussing a situation where census was conducted in a different year than 2010. In that case we discussed that a scale factor should be obtained by considering a model population of those alive in that year over the total census population of that year.

Another thing which is minor but important i think is that extract_results() assumes everyone is running the analysis via azure. I don't think this will always be the case with the new researchers?

Suggestion

I guess it could have been better if the existing scaling could have been made as an independent function and called within extract_results() or any other function that's scaling data not run via azure? Thereby also providing an opportunity for further updates on the scaling.


In any case I think I should update the existing function rather than developing a whole new one. I din't realise that we have a function already in place that's almost doing all I wanted.

@marghe-molaro
Copy link
Collaborator

Hi @mnjowe,
Why do you say that extract_results is only available for Azure results? I don't think there's anything preventing it from being used on local runs?
The function to retrieve the scaling factor is defined within the extract results function (get_multiplier), but I don't think there's any need for it to be modified, as it is the logging of the scaling factor that should be updated

@tbhallett
Copy link
Collaborator

Yes, I think the main thing we need to focus on is how that value that is logged as the scaling factor is computed. As you both rightly pointed out, the existing code hard-wires that the 2010 population size (starting population) is compared to a reference population size (in this case, the WPP?).

def compute_initial_model_to_data_popsize_ratio(self, initial_population_size):
"""Compute ratio of initial model population size to estimated population size in 2010.
Uses the total of the per-region estimated populations in 2010 used to
initialise the simulation population as the baseline figure, with this value
corresponding to the 2010 projected population from [wpp2019]_.
.. [wpp2019] World Population Prospects 2019. United Nations Department of
Economic and Social Affairs. URL:
https://population.un.org/wpp/Download/Standard/Population/
:param initial_population_size: Initial population size to calculate ratio for.
:returns: Ratio of ``initial_population`` to 2010 baseline population.
"""
return initial_population_size / self.parameters['pop_2010']['Count'].sum()

I think that still is OK, as long as we have a WPP value for 2010 for everywhere else (which I think we should).

But, I know that you're keen that we actually do the calibration to a census, the year of which could be ANYTHING.

So, I think the changes needed for that are:
1 - add in a parameter the designates the YEAR of the census.
2 - schedule an event for that year.
3 - let that event run code analogous to the above, and log it in all the same places, letting the key denote that this is the 'census derived scale factor'
4- include an option in extract_results you use one or other of these two possible scaling factors.

(Steps 2 & 3 may require some light refactoring to avoid code duplication),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Helper function for population scaling

4 participants