S-10

Automatic digitization of time series plots from published sources with EVA application

Veronika Voronova, Leonid Stolbov 1, Kirill Peskov 1,2,3, Kirill Zhudenkov 1,2, Victor Sokolov 1,3

1 M&S Decisions (Dubai, United Arab Emirates), 2 Research Center of Model-Informed Drug Development, Sechenov First Moscow State Medical University (Moscow, Russia), 3 Marchuk Institute of Numerical Mathematics of the Russian Academy of Sciences (INM RAS) (Moscow, Russia)

Objectives:
Extraction of quantitative data from published figures is a necessary but dreary and severely time-consuming step that follows systematic literature review – a foundation for any meta-analytic and systems modeling. The perspective of accelerating research through automation of this process is driving the refinement of existing software products (e.g. PlotDigitizer and Engauge Digitizer [1]) as well as development of new tools, which utilize both basic computer vision algorithms and advanced neural networks such as ChartReader, ChartDete, LineEX [2-4]. However, to this date, no software solution was developed to fully automatically extract data from complex time-series/scatterplots, associated with typical pharmacometrics data. EVA (Extraction, Visualization and Analysis) application within the Simurg® ecosystem determines to fill this gap.
Methods:
EVA functionality is implemented in Python v 3.12.7 and is based on parsing vector graphics from pdf files, followed by extraction information about pixel coordinates and properties (e.g. color and transparency) of graphical objects. Chart element classification as (1) x or y axes, (2) experimental data points, (3) data error bars, (4) x or y axis labels and ticks are done using heuristic rules based on element characteristics. On the next step conversion of coordinate space for data points and error bars from pixel coordinates to values is done using the RANSAC regression model considering information about labels values and respective tick positions. Finally matching between points and error bars is done based on their characteristics and relative positions.
Results:
EVA enables fully automatic extraction of experimental data from vectorized figures of .pdf files, including information on both data points and error bars of multiple time series of different colors and shapes simultaneously, from the plots in linear and logarithmic space. Computational simplicity of the proposed approach is a key advantage over neuronal network-based methods, eliminating the need for GPU usage. Another advantage of the utilized approach is a higher accuracy vs computer vision techniques, as element positions are explicitly extracted from the documents instead of being predicted by the model.
The program’s core is supplemented with a TypeScript-based user interface enabling import of the source document, navigation, rotation and zooming, figure area selection, running auto-extraction, assessment of the quality of digitization, manual correction of the result, and export of the final output. For raster images or complex plots including gapped axes or scientific notations the software can be used as a manual or a semi-automated digitization tool implying basic digitization steps, such as defining coordinate system and manual point and error bar digitization and grouping.
Conclusion:
EVA represents a unique and robust solution within the Simurg® ecosystem, developed for automatic digitization of time series plots associated with pharmacometrics data and serves as a basis for further development of computer vision algorithms, expanding functionality to any type of data and figures.

References:
1. CPT Pharmacometrics Syst. Pharmacol. (2020) 9, 322–331; doi:10.1002/psp4.12511
2. Shivasankaran, V. P., Muhammad Yusuf Hassan, and Mayank Singh. “LineEX: Data Extraction from Scientific Line Charts.” 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2023.
3. Yan, P., Ahmed, S., Doermann, D. (2023). Context-Aware Chart Element Detection. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition – ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14187. Springer, Cham. https://doi.org/10.1007/978-3-031-41676-7_13
4. Z. Cheng, Q. Dai and A. G. Hauptmann, “ChartReader: A Unified Framework for Chart Derendering and Comprehension without Heuristic Rules,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2023, pp. 22145-22156, doi: 10.1109/ICCV51070.2023.02029.

Reference: PAGE 34 (2026) Abstr 11872 [www.page-meeting.org/?abstract=11872]

Poster: Software Demonstration