Saturday, January 11, 2025
Google search engine
HomeLanguagesPyMongoArrow: Export and Import MongoDB data to Pandas DataFrame and NumPy

PyMongoArrow: Export and Import MongoDB data to Pandas DataFrame and NumPy

If you are a data scientist or a Python developer who sometimes wears the data scientist hat, it is highly likely that at some point you were required to work with these three tools & technologies: Pandas, NumPy and MongoDB.

  • Pandas: Pandas is a fast, powerful and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language.
  • NumPy: NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
  • MongoDB: MongoDB is an open-source document database and leading NoSQL database.

In this tutorial, we will see how to import and export data from MongoDB database into Pandas DataFrame and NumPy array using PyMongoArrow. It is a tool built by MongoDB that allows you to transfer data in and out of MongoDB into various other data formats such as Pandas DataFrame, NumPy array, Arrow Table in a very easy and efficient manner.

Installation & Setup

For this article, we are assuming that you have the recent version of Python. Now, let’s install PyMongoArrow by running the following commands in your terminal/powershell. If you already have PyMongoArrow installed then please make sure that it is >=0.5.0 version so that you are able to use all the functionalities shown in this tutorial.

python -m pip install pymongoarrow

In order to use PyMongoArrow with MongoDB Atlas (a fully-managed MongoDB database in cloud), we also need to install MongoDB’s pymongo driver with the srv extra.

python -m pip install 'pymongo[srv]'

If you don’t have Pandas and Numpy installed in your system, please run the following commands to do it:

python -m pip install pandas numpy

Connecting to your MongoDB Database

In this tutorial, we are assuming that you are familiar with the basics of PyMongo and MongoDB concepts. If not, head to the official documentation of PyMongo for a quick tutorial.

For this tutorial, we are going to use MongoDB’s Free atlas cluster as our data source for the sake of simplicity. If you already have an Atlas account and created your first cluster, go to your Atlas dashboard to get the connection string. If you haven’t created your first Atlas cluster yet, no need to worry – you can follow some really easy steps to create your first Atlas cluster, retrieve the connection string for connecting to it and follow along.

Now, let’s import all the necessary packages:

Python3




import pymongo
import pymongoarrow
from pymongo import MongoClient


Now, let’s create a MongoClient instance to establish a connection with our MongoDB database. If you are using an Atlas cluster, input your cluster connection string as an argument to the ‘MongoClient’ function.

Python3




client = MongoClient('Enter your Atlas cluster connection string here')


Note: If you are connecting to a locally running MongoDB instance which is running on the default host and port then this is how you can connect to it and follow along. “client = MongoClient(‘localhost’, 27017)”

Let’s connect to a test database named ‘test_database’ and a test collection named ‘test_collection’.

Python3




db = client.test_database
col = db.test_collection


An important note about collections (and databases) in MongoDB is that they are created lazily – none of the above commands have actually performed any operations on the MongoDB server. Collections and databases are created when the first document is inserted into them.

Now, let’s insert a few documents into the test_database.test_collection collection. But before we do that, let’s quickly import the datetime library as one of the fields in our document is going to be a date.

Python3




from datetime import datetime
  
col.insert_many([
    {'_id': 1, 'measure': 43, 'status': 'active',
        'installed_on': datetime(2022, 1, 8, 3, 43, 12)},
    {'_id': 2, 'measure': 32, 'status': 'active',
        'installed_on': datetime(2022, 2, 2, 11, 43, 27)},
    {'_id': 3, 'measure': 62, 'status': 'inactive',
        'installed_on': datetime(2022, 3, 12, 3, 53, 12)},
    {'_id': 4, 'measure': 59, 'status': 'active',
        'installed_on': datetime(2022, 4, 8, 3, 22, 45)}
])


Now, let’s verify that the data has been successfully written to your database by running the following commands:

Python3




import pprint
for doc in col.find({}):
  pprint.pprint(doc)


Output:

{'_id': 1,
 'installed_on': datetime.datetime(2022, 1, 8, 3, 43, 12),
 'measure': 43,
 'status': 'active'}
{'_id': 2,
 'installed_on': datetime.datetime(2022, 2, 2, 11, 43, 27),
 'measure': 32,
 'status': 'active'}
{'_id': 3,
 'installed_on': datetime.datetime(2022, 3, 12, 3, 53, 12),
 'measure': 62,
 'status': 'inactive'}
{'_id': 4,
 'installed_on': datetime.datetime(2022, 4, 8, 3, 22, 45),
 'measure': 59,
 'status': 'active'}

As you can see, we have successfully inserted 4 documents into our collection.

Now, let’s quickly run these two lines of code to patch PyMongo, in place. Which will basically allow us to use PyMongoArrow’s functionality directly to Collection instances of PyMongo.

Python3




from pymongoarrow.monkey import patch_all
patch_all()


Exporting MongoDB data into other formats

Exporting Data out of MongoDB into Pandas DataFrame

Now that we have inserted data into our MongoDB database, let’s quickly see how we can export MongoDB data to other data formats. We are going to use PyMongoArrow’s find_pandas_all() function to export MongoDB results set into Pandas DataFrame. We will pass this function a query predicate to filter the desired documents. For example, we want MongoDB to return all the documents which has a ‘measure’ field value greater than 40.

Python3




import pandas as pd
df = col.find_pandas_all({'measure': {'$gt': 40}})
df


Let’s quickly print the content of the ‘df’ dataframe to verify that data has been successfully written to it.

  _id  measure    status        installed_on
0    1       43    active 2022-01-08 03:43:12
1    3       62  inactive 2022-03-12 03:53:12
2    4       59    active 2022-04-08 03:22:45

As you can see, we have successfully exported MongoDB data into a Pandas DataFrame.

Exporting Data out of MongoDB into Numpy Array

Here, we are going to use PyMongoArrow’s find_numpy_all() function to export MongoDB results set into NumPy Array. In this case, we are going to leave the query predicate field empty to return us all the documents in the collection.

Did you know that PyMongoArrow also lets you define the schema of the data manually so that the exported data field values are in the desired format? You can define the schema by instantiating pymongoarrow.api.Schema using a mapping of field names to type-specifiers. For example:

Python3




import numpy as np
from pymongoarrow.api import Schema
  
# let's define the schema
schema = Schema({'_id': int, 'measure': float,
                 'status': str, 'installed_on': datetime})
npa = col.find_numpy_all({}, schema=schema)


Let’s quickly print the content of the ‘npa’ array to verify that data has been successfully written to it.

Output:

{'_id': array([1, 2, 3, 4]),
'measure': array([43., 32., 62., 59.]),
'status': array(['active', 'active', 'inactive', 'active'], dtype='<U8'),
'installed_on': array(['2022-01-08T03:43:12.000', '2022-02-02T11:43:27.000',
       '2022-03-12T03:53:12.000', '2022-04-08T03:22:45.000'],
      dtype='datetime64[ms]')}

Note: We can also export MongoDB data into Arrow Table format using find_arrow_all() function.

Python3




import arrow 
arrow_table = col.find_arrow_all({})


Importing Data from other formats into MongoDB

Importing Data from Pandas DataFrame into MongoDB

Importing data from Pandas DataFrame into MongoDB is very straightforward using PyMongoArrow’s write() function. write(collection, tabular) function takes two arguments:

  • collection – Name of the collection in which you want to write the data. 
  • tabular – which is an instance of result.ArrowWriteResult. It could be your pandas dataframe, NumPy ndarray, or Arrow Table

Let’s import the ‘write’ function and invoke it. We will pass it two arguments i.e. ‘name of the collection where we want to write the data’ and “dataframe which we want to write to MongoDB”. We are going to reuse the ‘df’ DataFrame that we created in the previous example. 

Python3




from pymongoarrow.api import write
write(db.pandas_data, df)


Let’s quickly verify that the content of ‘df’ DataFrame has been written to MongoDB successfully by running the following commands:

Python3




for doc in db.pandas_data.find({}):
  pprint.pprint(doc)


Output:

{'_id': 1,
'installed_on': datetime.datetime(2022, 1, 8, 3, 43, 12),
'measure': 43,
'status': 'active'}
{'_id': 3,
'installed_on': datetime.datetime(2022, 3, 12, 3, 53, 12),
'measure': 62,
'status': 'inactive'}
{'_id': 4,
'installed_on': datetime.datetime(2022, 4, 8, 3, 22, 45),
'measure': 59,
'status': 'active'}

As you can see, we have successfully written Pandas DataFrame data into MongoDB.

Importing Data from NumPy Array into MongoDB

We will use the write() function and pass it two arguments i.e “name of the collection where we want to write the data’ and “numpy array which we want to write to MongoDB”. We are going to reuse the ‘npa’ array that we created in the previous example.

Python3




write(db.numpy_data, npa)


Output:

{'insertedCount': 4}

Let’s quickly verify that the content of ‘npa’ array has been written to MongoDB successfully by running the following commands:

Python3




for doc in db.numpy_data.find({}):
  pprint.pprint(doc)


Output:

{'_id': 1,
'installed_on': datetime.datetime(2022, 1, 8, 3, 43, 12),
'measure': 43,
'status': 'active'}
{'_id': 3,
'installed_on': datetime.datetime(2022, 3, 12, 3, 53, 12),
'measure': 62,
'status': 'inactive'}
{'_id': 4,
'installed_on': datetime.datetime(2022, 4, 8, 3, 22, 45),
'measure': 59,
'status': 'active'}

As you can see, we have successfully written numpy array data into our MongoDB database.

Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments