Search
Sharing massive files with collaborators

Solution for sharing large files with your group

This is a "guest lecture" from Harry Hawkes. I've modified it slightly in some places. Thanks, Harry!


Professor Bowen asked me to share my solution for sharing large files with your group or future users of the programs written for the final project.

The solution I found was to upload the (compressed) files to your Lehigh Google Drive and then group members can download from Google Drive in the program.

This works well if your data is from multiple sources or sources that require an API because you only need to visit those sources once. It also helps because everyone doesn't need to keep such large files on their computers, and compressing the files speeds up the download times.

PROF: For projects where you only load the big data files one time, and then shrink the data to manageable sizes for analysis, it makes sense to save the analysis-ready files to your shared GitHub repo that rests on everyone's computer to ensure your analysis always starts from the same point, and everyone can work on the analysis while skipping the data construction steps.

The solution

Steps to complete the process:

  1. Upload file to Google Drive
  2. Turn on sharing for the file by right clicking it and selecting get shareable link
  3. Turn on sharing for users outside of Lehigh university - you don't need to give edit access
  4. Insert the function from this page into your program
  5. Get the ID from the sharing link (found at the end of the sharing URL) and insert this ID as well as the location of where you want the downloaded file to be stored into the function:

    download_file_from_google_drive('ID', 'LOCATION')
    
  6. If you compressed the files, you'll need to unzip them. One option:

    import zipfile
     with zipfile.ZipFile('LOCATION', 'r') as zip_ref:
         zip_ref.extractall('input')
    

Function:

import requests

def download_file_from_google_drive(id, destination):
    def get_confirm_token(response):
        for key, value in response.cookies.items():
            if key.startswith('download_warning'):
                return value

        return None

    def save_response_content(response, destination):
        CHUNK_SIZE = 32768

        with open(destination, "wb") as f:
            for chunk in response.iter_content(CHUNK_SIZE):
                if chunk: # filter out keep-alive new chunks
                    f.write(chunk)

    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    


if __name__ == "__main__":
    import sys
    if len(sys.argv) is not 3:
        print("Usage: python google_drive.py drive_file_id destination_file_path")
    else:
        # TAKE ID FROM SHAREABLE LINK
        file_id = sys.argv[1]
        # DESTINATION FILE ON YOUR DISK
        destination = sys.argv[2]
        download_file_from_google_drive(file_id, destination)