Customizable script to generate 3D embeddings, topic labels, and a neighbor matrix using Lever for Change proposal data.
For use with non-LfC data, implement your own cleaning function.
pip install -r requirements.txtIn case problems arise, consult: https://stackoverflow.com/questions/64261546/how-to-solve-error-microsoft-visual-c-14-0-or-greater-is-required-when-inst- The NLP tasks require some external datasets to be downloaded on the local machine.
- Run the following command in your terminal:
python -m textblob.download_corpora. If you encounter issues, consult: https://stackoverflow.com/questions/41310885/error-downloading-textblob-certificate-verify-failed - Then, open a python instance and run:
import nltk, thennltk.download('stopwords'), thennltk.download('omw-1.4'). If you have problems, consult here: https://stackoverflow.com/questions/38916452/nltk-download-ssl-certificate-verify-failed - If more problems persist, you may need to download nltk corpora "al la carte". Follow the python shell commands outputted in red by nltk.
args.jsoncontains keyword arguments for each step in the pipeline. Stages can be added or dropped by setting therunboolean for each stage. For example, if a cleaned and preprocessed dataset already exists, you may want to set allrunflags tofalseexcept forapply_umapandtopic_model. I would recommend copying this file and naming itargs.local.json, which will be ignored by git. This way, the original config can be preserved.python pipeline.py- output files will be placed underdata/.- To upload the output files to S3 - a requirement for the hosted webapp to run with updated data, don't forget to cycle the container - ensure the
s3_uploaderrun flag is set totrue. You'll need to runaws configurein your terminal first to set up the AWS environment locally. Ensure theAWS_ACCESS_KEYandAWS_ACCESS_SECRETare up to date in the container's environment variables.
A small function to convert the KNN indices matrix to a CSV can be found in knnToCsv.py. This is useful for recalculating similar proposals for ingestion into Global View.
Required fields for downloaded Torque file, if using the pipeline to regenerate embeddings for the landscape app. For generic use, implement your own cleaning step, and adjust the document_col parameter.
"Primary Subject Area",
"Total Projected Costs",
"Priority Populations",
"Future Work Locations",
"Annual Operating Budget",
"Number of Employees",
"Organization Name",
"Organization Location",
"Executive Summary"
It is also recommended to exclude 100Change2017 from the downloaded proposals.
input_file_nameName of the input CSV file to run the pipeline on.output_file_nameThe name of the cleaned CSV file outputted in the cleaning step.model_tagA prefx that will be appended to each output file, useful for identifying across multiple pipeline runs.
The cleaner.py module cleans the data downloaded from Torque.
document_colThis field specifies which field to turn into theDocumentcolumn in the dataframe. This field will become the one used to perform the TF-IDF vectorization and subsequent dimension reduction on.
The preprocessor.py module transforms the Document column of the data into a TF-IDF vector matrix.
See more descriptions of argument values here
Currently Document is hardcoded to be the only feature evaluated in the UMAP model, but this can easily be extended to include other text, categorical, or numeric fields (see the scikit-learn docs).
I've found that the Executive Summary provides the best output with the remaining fields useful for filtering/grouping the output.
The apply-umap.py module fits the preprocessed data to a UMAP model and produces a 2D matrix of KNN indices.
metricthis is just for the UMAP step, the neighbor finding step will always useeuclideancomponentsthe number of components to reduce tomin_distthe minimum distance between two points, below which UMAP will disregard in dimension reductiondensmapturning this on leads to clumpier datan_neighborsthe number of neighbors to consider during UMAP computationthresholdthe euclidean distance below which an external node is considered a neighbor or not, higher vaues result in more neighbors per pointneighbor_approachwhether to find a mapping of neighbors for each proposal based on N nearest neighbors or any neighbor within a radius. Options are eitherradiusorknn. Ifradiusthethresholdvalue will be used, ifknnthen_neighborsvalue will be used.radiusis recommended for building an embedding for the proposal landscape app, whileknnis better suited for identifying similar proposals in Global View.
Learn more in the UMAP documentation and the KNN documentation
The topic-model.py module takes the dimension-reduced data from UMAP and finds clusters in the data space. It then generates a list of labeled clusters (by most important words) and their associated cluster centers. The module uses HDBSCAN, which allows for data points to be assigned -1 as a cluster - in other words not belonging to any cluster. For more information on the topic modelling approach, see here. See the HDBSCAN docs for descriptions of the arguments.
The s3_uploader.py module puts every file found in data/ into an S3 bucket. You'll need to create your own AWS account, S3 bucket, and AWS access key.
bucketthe name of the bucket housing your model dataaclthe default ACL for each file uploaded, recommendprivateso that objects can only be accesse with access keys
Pandas DataFrame of cleaned proposal data stored in consts.py as df in the web app repository and is outputted as a CSV with the name defined in output_file_name.
Note that the Application # column is not the index corresponding to the proposal in the loaded dataframe, the index is a different value internally generated by pandas.
The topic model module adds two columns to the cleaned dataset: Topic and Outlier Score. Topic is a numeric variable corresponsing to the index of the topic specified in topics.pkl. Outlier Score is a numeric variable denoting how closely the observation is associated to its assigned topic. Higher values mean that the observation is more loosely associated with its cluster, and is therefore more of a unique/outlier observation.
An embedding matrix with coordinates for each proposal on each dimension in [1, components]. Stored in consts.py as embeddings in the web app repository and is outputted in data/ as <MODEL_TAG>_embeddings.pkl.
With the default number of components (3), the matrix simply becomes a list of x, y, z coordinates. The web app will need minor tweaks to support 2D coordinate pairs or certain subsets of higher-dimensional data.
[
(3.2345, 1.009, 0.13677),
(6.32454, 2.083, 1.562),
...,
(4.09092, 3.0004, 0.158)
]
Stored in consts.py as knn_indices in the web app repository and is outputted in data/ as <MODEL_TAG>_knn_indices.pkl.
For each element in the KNN Indices Matrix, the index of the element corresponds to a proposal at index i in the Proposal Dataframe. Each element contains a list of indices, which represent neighbors of that proposal. A proposal may have no neighbors (empty list).
For example:
[
[0, 10, 15],
[],
...,
[5000]
]
These indices might evaluate to something like (made up names):
[
[Health Equity in Chicago, Building a Hospital in Indiana, Advocacy for Children of Cancer Patients],
[],
...,
[Investing in Guatemalan Science Education]
]
Stored in consts.py as topics in the web app repository and in data/ as <MODEL_TAG>_topics.pkl.
This is a dictionary of cluster_label -> topic_information mappings. The first key is always -1, which corresponds to the "non-cluster" entry and therefore does not have an exemplar entry. All other entries have a set of words and a 3D exemplar coordinate, denoting the coordinates of the point which represents the best center of the cluster.
{
-1: {
'words': ['social', 'economic', 'work', 'education', 'poverty']
},
0: {
'words': ['news', 'newsroom', 'medium', 'journalism', 'journalist'],
'exemplar': (7.306426048278809, -0.29259100556373596, -2.392807960510254)
}
}
The keys in the topics dictionary map to the Topic column in the Proposal Dataframe.
For convenience, a utility function is provided to map proposals to a list of neighboring proposals. The CSV generated has two columns, the first containing the ID of a target proposal, and the second a comma separated list of neighboring proposals (or blank if no neighbors exist). In other words, this is a reformatted CSV version of the KNN Indices Matrix.
Example command: python knnToCsv.py LFC_proposals-clean.csv LFC
Args:
filenameThe name of the cleaned CSV generated in the cleaning step of the pipelinemodelTagThe model tag of the desired pipeline output - determines which knn_indicies matrix to select