sparrow-ipc 1.0.1
Loading...
Searching...
No Matches
sparrow_ipc::stream_file_serializer Class Reference

A class for serializing Apache Arrow record batches to the IPC file format. More...

#include <stream_file_serializer.hpp>

Collaboration diagram for sparrow_ipc::stream_file_serializer:
[legend]

Public Member Functions

template<writable_stream TStream>
 stream_file_serializer (TStream &stream, std::optional< CompressionType > compression=std::nullopt)
 Constructs a stream_file_serializer object with a reference to a stream.
 
template<writable_stream TStream>
 stream_file_serializer (TStream &stream, const sparrow::record_batch &schema_batch, std::optional< CompressionType > compression=std::nullopt)
 Constructs a stream_file_serializer object with a reference to a stream and a schema.
 
 ~stream_file_serializer ()
 Destructor for the stream_file_serializer.
 
void write (const sparrow::record_batch &rb)
 Writes a single record batch to the file.
 
template<std::ranges::input_range R>
requires std::same_as<std::ranges::range_value_t<R>, sparrow::record_batch>
void write (const R &record_batches)
 Writes a collection of record batches to the file.
 
stream_file_serializeroperator<< (const sparrow::record_batch &rb)
 
template<std::ranges::input_range R>
requires std::same_as<std::ranges::range_value_t<R>, sparrow::record_batch>
stream_file_serializeroperator<< (const R &record_batches)
 
stream_file_serializeroperator<< (stream_file_serializer &(*manip)(stream_file_serializer &))
 
void end ()
 Finalizes the file serialization by writing footer and trailing magic bytes.
 

Public Attributes

bool m_header_written {false}
 
bool m_schema_received {false}
 
std::optional< sparrow::record_batch > m_first_record_batch
 
std::vector< sparrow::data_type > m_dtypes
 
any_output_stream m_stream
 
bool m_ended {false}
 
std::optional< CompressionTypem_compression
 
dictionary_tracker m_dict_tracker
 
std::vector< record_batch_blockm_dictionary_blocks
 
std::vector< record_batch_blockm_record_batch_blocks
 

Detailed Description

A class for serializing Apache Arrow record batches to the IPC file format.

The stream_file_serializer class provides functionality to serialize single or multiple record batches into the Arrow IPC file format suitable for storage. It ensures schema consistency across multiple record batches and optimizes memory allocation by pre-calculating required buffer sizes.

The stream_file_serializer follows the Arrow IPC file format specification:

  • File header magic bytes (ARROW1 + padding)
  • Stream format data (schema + record batches + end-of-stream marker)
  • Footer (FlatBuffer containing schema and empty record batch blocks)
  • Footer size (int32)
  • Trailing magic bytes (ARROW1)

The class validates that all record batches have consistent schemas and throws std::invalid_argument if inconsistencies are detected.

Note
Unlike the stream serializer, the file serializer automatically writes the complete file format (including header and footer) when end() is called or when the destructor is invoked.

Definition at line 70 of file stream_file_serializer.hpp.

Constructor & Destructor Documentation

◆ stream_file_serializer() [1/2]

template<writable_stream TStream>
sparrow_ipc::stream_file_serializer::stream_file_serializer ( TStream & stream,
std::optional< CompressionType > compression = std::nullopt )
inline

Constructs a stream_file_serializer object with a reference to a stream.

Template Parameters
TStreamThe type of the stream to be used for serialization.
Parameters
streamReference to the stream object that will be used for serialization operations. The serializer stores a pointer to this stream for later use.
compressionOptional compression type to apply to record batch bodies.

Definition at line 83 of file stream_file_serializer.hpp.

Here is the caller graph for this function:

◆ stream_file_serializer() [2/2]

template<writable_stream TStream>
sparrow_ipc::stream_file_serializer::stream_file_serializer ( TStream & stream,
const sparrow::record_batch & schema_batch,
std::optional< CompressionType > compression = std::nullopt )
inline

Constructs a stream_file_serializer object with a reference to a stream and a schema.

This constructor allows establishing the schema for the file immediately, which is useful when the number of record batches is zero or when the schema is known upfront.

Template Parameters
TStreamThe type of the stream to be used for serialization.
Parameters
streamReference to the stream object that will be used for serialization operations.
schema_batchA record batch containing the schema for the file. The data in this batch is NOT written to the file; only its schema is used.
compressionOptional compression type to apply to record batch bodies.

Definition at line 102 of file stream_file_serializer.hpp.

Here is the call graph for this function:

◆ ~stream_file_serializer()

sparrow_ipc::stream_file_serializer::~stream_file_serializer ( )

Destructor for the stream_file_serializer.

Ensures proper cleanup by calling end() if the serializer has not been explicitly ended. This guarantees that the complete file format (including footer and trailing magic bytes) is written before the object is destroyed.

Member Function Documentation

◆ end()

void sparrow_ipc::stream_file_serializer::end ( )

Finalizes the file serialization by writing footer and trailing magic bytes.

This method completes the Arrow IPC file format by:

  1. Writing the end-of-stream marker
  2. Writing the footer (FlatBuffer containing schema)
  3. Writing the footer size (int32)
  4. Writing the trailing magic bytes (ARROW1)

It can be called multiple times safely as it tracks whether the file has already been ended to prevent duplicate operations.

Note
This method is idempotent - calling it multiple times has no additional effect.
Postcondition
After calling this method, m_ended will be set to true.
Exceptions
std::runtime_errorif no record batches have been written
Examples
/home/runner/work/sparrow-ipc/sparrow-ipc/include/sparrow_ipc/stream_file_serializer.hpp.

◆ operator<<() [1/3]

template<std::ranges::input_range R>
requires std::same_as<std::ranges::range_value_t<R>, sparrow::record_batch>
stream_file_serializer & sparrow_ipc::stream_file_serializer::operator<< ( const R & record_batches)
inline

Definition at line 292 of file stream_file_serializer.hpp.

Here is the call graph for this function:

◆ operator<<() [2/3]

stream_file_serializer & sparrow_ipc::stream_file_serializer::operator<< ( const sparrow::record_batch & rb)
inline

Definition at line 266 of file stream_file_serializer.hpp.

Here is the call graph for this function:

◆ operator<<() [3/3]

stream_file_serializer & sparrow_ipc::stream_file_serializer::operator<< ( stream_file_serializer &(* manip )(stream_file_serializer &))
inline

Definition at line 312 of file stream_file_serializer.hpp.

Here is the call graph for this function:

◆ write() [1/2]

template<std::ranges::input_range R>
requires std::same_as<std::ranges::range_value_t<R>, sparrow::record_batch>
void sparrow_ipc::stream_file_serializer::write ( const R & record_batches)
inline

Writes a collection of record batches to the file.

This method efficiently adds multiple record batches to the serialization stream by first calculating the total required size and reserving memory space to minimize reallocations during the append operations.

Template Parameters
RThe type of the record batch collection (must be iterable)
Parameters
record_batchesA collection of record batches to append to the file
Exceptions
std::runtime_errorif the serializer has been ended
std::invalid_argumentif any record batch schema doesn't match

The method performs the following operations:

  1. Writes file header magic bytes (if first write)
  2. Calculates the total size needed for all record batches
  3. Reserves the required memory space in the stream
  4. Writes schema message (if first write)
  5. Iterates through each record batch and writes it to the stream

Definition at line 161 of file stream_file_serializer.hpp.

Here is the call graph for this function:

◆ write() [2/2]

void sparrow_ipc::stream_file_serializer::write ( const sparrow::record_batch & rb)

Writes a single record batch to the file.

Parameters
rbThe record batch to write to the file
Exceptions
std::runtime_errorif the serializer has been ended
std::invalid_argumentif the record batch schema doesn't match the established schema
Here is the caller graph for this function:

Member Data Documentation

◆ m_compression

std::optional<CompressionType> sparrow_ipc::stream_file_serializer::m_compression

Definition at line 341 of file stream_file_serializer.hpp.

◆ m_dict_tracker

dictionary_tracker sparrow_ipc::stream_file_serializer::m_dict_tracker

Definition at line 342 of file stream_file_serializer.hpp.

◆ m_dictionary_blocks

std::vector<record_batch_block> sparrow_ipc::stream_file_serializer::m_dictionary_blocks

Definition at line 343 of file stream_file_serializer.hpp.

◆ m_dtypes

std::vector<sparrow::data_type> sparrow_ipc::stream_file_serializer::m_dtypes

Definition at line 338 of file stream_file_serializer.hpp.

◆ m_ended

bool sparrow_ipc::stream_file_serializer::m_ended {false}

Definition at line 340 of file stream_file_serializer.hpp.

◆ m_first_record_batch

std::optional<sparrow::record_batch> sparrow_ipc::stream_file_serializer::m_first_record_batch

Definition at line 337 of file stream_file_serializer.hpp.

◆ m_header_written

bool sparrow_ipc::stream_file_serializer::m_header_written {false}

Definition at line 335 of file stream_file_serializer.hpp.

◆ m_record_batch_blocks

std::vector<record_batch_block> sparrow_ipc::stream_file_serializer::m_record_batch_blocks

Definition at line 344 of file stream_file_serializer.hpp.

◆ m_schema_received

bool sparrow_ipc::stream_file_serializer::m_schema_received {false}

Definition at line 336 of file stream_file_serializer.hpp.

◆ m_stream

any_output_stream sparrow_ipc::stream_file_serializer::m_stream

Definition at line 339 of file stream_file_serializer.hpp.


The documentation for this class was generated from the following file: