Django Bulk Save

Jispa - 2019

I have had no qualms working with Django over the past two years. However, I ran into some problems when I had to work with bulk creation.

The ask was simple, parse a CSV file and save each row into the database. The template structure and values was not in my sphere of influence so I'd have to read and transform certain attributes/cells of each row. The CSV would have atleast 5000 rows.

Approach #1

I parsed the csv into a list of dictionaries and passed it to a ModelSerializer, which would validate and transform certain attributes. I overrided the serializer's save method and used a list serializer to do the bulk_create operation.

class ListIceCreamSerializer(serializers.ListSerializer):
    def create(self, validated_data):
        result = [self.child.create(attrs) for attrs in validated_data]
        IceCream.objects.bulk_create(result, ignore_conflicts=True)

        return result

class IceCreamSerializer(serializers.ModelSerializer):
    def create(self, validated_data):
        return IceCream(**validated_data)

    class Meta:
        model = IceCream
        fields = "__all__"
        list_serializer_class = ListIceCreamSerializer

The time taken to save 5000 rows took around 15 seconds. Since it was supposed to be a synchronous request, it was way too high.

Approach #2

I decided to use the copy_from function of postgres which copies a file directly into the database. Since I wasn't using a serializer, the validations and transformations were done while transforming each row to a dictionary.  The list of dictionaries would then be written to a temporary CSV which would then be copied into the database.

@action(methods=["POST"], detail=False)
def bulk_upload(self, request, *args, **kwargs):
    uploaded_file = request.FILES["file"]
    file_stream = io.StringIO('utf-8'))
    csv_data = pandas.read_csv(file_stream, delimiter=',').to_dict('records')

    stream = io.StringIO()
    writer = csv.writer(stream, delimiter='\t')

    for row in csv_data:
        writer.writerow([str(uuid.uuid4()), row["name"]])

    with closing(connection.cursor()) as cursor:
            columns=('id', 'name'),
    return Response(data=csv_data, status=status.HTTP_200_OK)

The copy_from function scales really well, I could write 20000 rows in a second. One of the caveats is that you wouldn't get a response with what you have saved in that particular session. Also unlike the ignore_conflicts flag you get with the bulk_create function, copy_from will reject and raise an error if one of the rows fails to save. Also creating temporary files in memory is something that you probably wouldn't do if you have a large number of concurrent users performing the bulk upload workflow.

Here's a link to the repo if you want to benchmark.