From 5cd2409afc34c297cc2f93600e9886eec48ab8d8 Mon Sep 17 00:00:00 2001 From: Brandon Rozek Date: Sat, 11 Apr 2020 22:10:19 -0400 Subject: [PATCH] New Post --- content/blog/iterativecsv.md | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) create mode 100644 content/blog/iterativecsv.md diff --git a/content/blog/iterativecsv.md b/content/blog/iterativecsv.md new file mode 100644 index 0000000..a906b98 --- /dev/null +++ b/content/blog/iterativecsv.md @@ -0,0 +1,31 @@ +--- +title: "Iteratively Read CSV" +date: 2020-04-11T21:34:33-04:00 +draft: false +tags: ["python"] +--- + +If you want to analyze a CSV dataset that is larger than the space available in RAM, then you can iteratively process each observation and store/calculate only what you need. There is a way to do this in standard Python as well as the popular library Pandas. + +## Standard Library + + ```python +import csv +with open('/path/to/data.csv', newline='') as csvfile: + reader = csv.reader(csvfile, delimeter=',') + for row in reader: + for column in row: + do_something() + ``` + +## Pandas + +Pandas is slightly different in where you specify a `chunksize` which is the number of rows per chunk and you get a pandas dataframe with that many rows + +```python +import pandas as pd +chunksize = 100 +for chunk in pd.read_csv('/path/to/data.csv', chunksize=chunksize): + do_something(chunk) +``` +