overleaf/services/web/scripts/delete-orphaned-docs
Timothée Alby 6734d9e107 Merge pull request #2641 from overleaf/ew-delete-orphaned-docs
Process and Script to Delete Orphaned Docs

GitOrigin-RevId: 0c428e36973a9131e3d379fdf4657e802576b23d
2020-03-10 04:14:46 +00:00
..
.gitignore
delete-orphaned-docs.js
README.md

Delete Orphaned Docs

Because of the large numbers of documents and projects it is necessary to detect orphaned docs using bulk exports of the raw data.

Exporting Data Files

Follow the directions in google-ops/README.md for exporting data from mongo and copying the files to your local machine.

Exporting docs

Run the following doc export command to export all doc ids and their associated project ids in batches of 10,000,000.

mongoexport --uri $READ_ONLY_MONGO_CONNECTION_STRING --collection docs --fields '_id,project_id' --skip 0 --limit 10000000 --type=csv --out docs.00000000.csv

This will produce files like:

_id,project_id
ObjectId(5babb6f864c952737a9a4c32),ObjectId(5b98bba5e2f38b7c88f6a625)
ObjectId(4eecaffcbffa66588e000007),ObjectId(4eecaffcbffa66588e00000d)

Concatenate these into a single file: cat docs.*csv > all-docs-doc_id-project_id.csv

For object ids the script will accept either plain hex strings or the ObjectId(...) format used by mongoexport.

Exporting Projects

Export project ids from all projects and deletedProjects

mongoexport --uri $READ_ONLY_MONGO_CONNECTION_STRING --collection projects --fields '_id' --type=csv --out projects.csv
mongoexport --uri $READ_ONLY_MONGO_CONNECTION_STRING --collection deletedProjects --fields 'project._id' --type=csv --out deleted-projects.csv

Concatenate these: cat projects.csv deleted-projects.csv > all-projects-project_id.csv

Processing Exported Data

Create a unique sorted list of project ids from docs

cut -d, -f 2 all-docs-doc_id-project_id.csv | sort | uniq > all-docs-project_ids.sorted.uniq.csv

Create a unique sorted list of projects ids from projects

sort all-projects-project_id.csv | uniq > all-projects-project_id.sorted.uniq.csv

Create list of project ids in docs but not in projects

comm --check-order -23 all-docs-project_ids.sorted.uniq.csv all-projects-project_id.sorted.uniq.csv > orphaned-doc-project_ids.csv

Create list of docs ids with project ids not in projects

grep -F -f orphaned-doc-project_ids.csv all-docs-doc_id-project_id.csv > orphaned-doc-doc_id-project_id.csv

Run doc deleter

node delete-orphaned-docs orphaned-doc-doc_id-project_id.csv

Commit Changes

By default the script will only print the list of project ids and docs ids to be deleted. In order to actually delete docs run with the --commit argument.

Selecting Input Lines to Process

The --limit and --offset arguments can be used to specify which lines to process. There is one doc per line so a single project will often have multiple lines, but deletion is based on project id, so if one doc for a project is deleted all will be deleted, even if all of the input lines are not processed.