A more accessable APS Gazette
May 26, 2013
The APS Gazette
Since 2010 I’ve been interested in analysing the data that is contained in the APS Gazette, but unfortunately despite the regular format of the gazette notices, their published format does not lend itself to aggregate analysis in any form.
There is also the APS Statistical Bulletin which provides aggregated statistics on a annual (soon to be six monthly basis).
In contrast the APS Gazette publishes individual level detail of all job opportunities, engagements, promotions, transfers and terminations/seperations every week.
The type of questions that could be answered using this detailed data include (thanks to Adam Sheppard for suggesting the last two):
- What job types are in demand
- What proportion of job opportunities are not filled?
- How fast is the fastest rise through the APS levels?
- Which agencies promote the fastest
- Are there more promotions internal to an agency than external (that is, do agencies have a bias for promoting internals, or bringing new people in who were a level below at their previous agency)?
- What is the average time between promotions that occur within an agency vs promotions where a person’s last movement (promotion or at level) was from another agency?
So with some of those questions in mind, I set out to convert the APS Gazette into a usable format.
August 2007 to May 2013
There was a format change to the Gazette in August 2007 (PS 31 2007) to its current format. I have focussed on the current format, and that has resulted in detailed data on approximately:
- 76,662 Employment opportunities
- 72,434 Engagements
- 65,692 Promotions
- 16,232 Movement
- 11,885 Retirements / terminations (seperations)
From the beginning
The first part of the process was collecting all of the gazettes - neither the APSJobs website nor the [NLA Pandora Archive] (http://pandora.nla.gov.au/tep/75984) has the full set easily accessible - although Pandora is close. If you’re missing one, you can ask me or probably the NLA or APSC can provide it too. I chose to use the PDF format Gazettes, but you could use a similar process to parse the DOC format too.
Unfortunately each PDF appeared to be named somewhat haphazardly. I settled on a YYYY-PS-NN format where YYYY is the year and NN is the gazette number.
The next step was to convert the PDFs into text. I tried a number of options, including tesseract and OCRopus but in the end realised that I didn’t need to perform any OCRing, and so settled on pdf2txt from PDFMiner which outputted pretty clean plain text files from the PDFs.
The next step was the slowest - building a semantic parsing engine to read each gazette and extract just the relevant parts.
At the moment I have only started on the high level analysis. For example, there are some interesting cyclic features to each of the series:
Movements / transfers
In the next few days I’m hoping to dig into the detail more. For example, I need to parse each notice type to extract features related specifically to the notice.
Depending how far I get, I may try to release the parsing tools I’m using and put together an entry for GovHack.
For now all I can offer for download is the summary statistics but watch this space for more to come…