This repository provides a reproducible pipeline for processing, geocoding, and classifying CNPJs from the Brazilian Annual Social Information Report (RAIS) of the Brazilian Ministry of Labor and Employment (MTE) using the Locais-Nova scale.
The report is available here.
If you find this project useful, please consider giving it a star!
The processed data are available in csv, rds and parquet formats through a dedicated repository on the Open Science Framework (OSF). A metadata file is included alongside the validated datasets.
Because the raw data are not publicly available, only authorized personnel can access the processed files. They are protected with RSA 4096-bit encryption (OpenSSL) and a 32-byte password to ensure data security.
If you already have access to the OSF repository and the project keys, click here to access the data. You can also retrieve these files directly from R using the osfr package.
The pipeline was developed using the Quarto publishing system, along with the R and AWK programming languages. To ensure consistent results, the renv package is used to manage and restore the R environment.
Access to the raw data is restricted. Running the analyses requires an active internet connection and a set of access keys (see the Keys section). Do not use VPNs, corporate proxies, or other network-routing tools while processing the data, as these can interfere with authentication and downloads.
Make sure the AWK executable directory is added to your PATH environment variable.
After installing the four dependencies mentioned above and setting all the keys, follow these steps to reproduce the analyses:
- Clone this repository to your local machine.
- Open the project in your preferred IDE.
- Restore the R environment by running
Sys.setenv(LIBARROW_MINIMAL = "false")andrenv::restore()in the R console. This will install all required software dependencies. - Open
index.qmdand run the code as described in the report.
To access the data and run the Quarto notebook, you must first obtain authorization to access the project's OSF repositories and Google Sheets files.
Once you have the necessary permissions, run the following command to authorize your access to the Google Sheets API:
library(gargle)
library(googlesheets4)
options(gargle_oauth_cache = ".secrets")
gs4_auth()
gargle_oauth_cache()Next, create a file named .Renviron in the root directory of the project and add the following environment variables:
OSF_PAT: Your OSF Personal Access Token (PAT). If you don't have one, go to the settings section of your OSF account and create a new token.ACESSOSAN_PASSWORD: The password for the project's RSA private key (32 bytes).
Example (do not use these values):
OSF_PAT=bWHtQBmdeMvZXDv2R4twdNLjmakjLUZr4t72ouAbNjwycGtDzfm3gjz4ChYXwbBaBVJxJR
ACESSOSAN_PASSWORD=MmXN_od_pe*RdHgfKTaKiXdV7KD2qPzWAdditionally, you will need the following keys in the project's _ssh folder:
id_rsa: The project's private RSA key (RSA 4096 bits (OpenSSL)).id_rsa.pub: The project's public RSA key.
These project's keys are provided to authorized personnel only. If you need access, please contact the authors.
Error in `dplyr::compute()`:
! NotImplemented: Support for codec 'zstd' not built
This error occurs when arrow is missing certain dependencies. To fix it, run:
Sys.setenv(LIBARROW_MINIMAL = "false")Then reinstall the arrow package:
install.packages("arrow")Important
When using this data, you must also cite the original data sources.
To cite this work, please use the following format:
Caldeira, G., Penz, C., Vartanian, D., Fernandes, C. N., & Giannotti, M. A. (2025). A reproducible pipeline for processing, geocoding, and classifying CNPJs from the Annual Social Information Report (RAIS) of the Brazilian Ministry of Labor and Employment (MTE) using the Locais-Nova scale [Computer software]. Center for Metropolitan Studies of the University of São Paulo. https://cem-usp.github.io/locais-nova-rais-geocoding
A BibLaTeX entry for LaTeX users is:
@software{caldeira2025,
title = {A reproducible pipeline for processing, geocoding, and classifying CNPJs from the Annual Social Information Report (RAIS) of the Brazilian Ministry of Labor and Employment (MTE) using the Locais-Nova scale},
author = {{Gabriel Caldeira} and {Clara Penz} and {Daniel Vartanian} and {Camila Nastari Fernandes} and {Mariana Abrantes Giannotti}},
year = {2025},
address = {São Paulo},
institution = {Center for Metropolitan Studies of the University of São Paulo},
langid = {en},
url = {https://cem-usp.github.io/locais-nova-rais-geocoding}
}Important
The original data sources may be subject to their own licensing terms and conditions.
The code in this repository is licensed under the GNU General Public License Version 3, while the report is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
Copyright (C) 2025 Center for Metropolitan Studies
The code in this report is free software: you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your option)
any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program. If not, see <https://www.gnu.org/licenses/>.
|
|
This work is part of a research project by the Polytechnic School (Poli) of the University of São Paulo (USP), in partnership with the Secretariat for Food and Nutrition Security (SESAN) of the Ministry of Social Development, Family, and the Fight Against Hunger (MDS), titled: AcessoSAN: Mapping Food Access to Support Public Policies on Food and Nutrition Security and Hunger Reduction in Brazilian Cities. |
|
|
This work was developed with support from the Center for Metropolitan Studies (CEM) based at the School of Philosophy, Letters and Human Sciences (FFLCH) of the University of São Paulo (USP) and at the Brazilian Center for Analysis and Planning (CEBRAP). |
|
|
This study was financed, in part, by the São Paulo Research Foundation (FAPESP), Brazil. Process Number 2025/17879-2. |