[Libreoffice] One Git v2.0

Norbert Thiebaud nthiebaud at gmail.com
Thu Jun 30 13:37:24 PDT 2011


0. Introduction
===========

The fist incarnation of onegit.sh, despite all the tuning effort, was
still taking 36 hours to run. That was within the 'we should be able
to do the conversion over a week-end' criteria, but it was still
painfully long.

So, plan B, I tried to use git fast-export/import instead of git
filter-branch. That plan proved successful and now the conversion
itself takes about 30 _minutes_ (add another 15 for a final git gc of
the resulting core.git and a couple of hours to upload it all)

The core of plan B is lo_git_rewrite, a small C program that massage
the data stream between git fast-export and git-fast-import. It is
available in the dev-tools git repo.

1. Usage Notes
============

If you want to try it for yourself, here are few things to know:

1.1 Pre-requisites

1.1.1 Platform
This has only been tested on Linux. Other platform may work, but use
at your own risk.

1.1.2 Git
This has been tested with git 1.7.3.4. Any recent git should work, but
lo_git_rewrite makes a lot of implicit assumptions about the data
stream provided by git fast-export, so any version of git that alter
that flow, even in a way compatible with git fast-import
specifications may cause trouble.

1.1.3 source git repos
You need to have a 'source' bootstrap tree, including
clone/translation. make sure that master is checked-out and that you
are up-to-date and clean.

1.1.4 dev-tools
You need to clone the contrib/dev-tools repo and run make in
dev-tools/lo_git_rewrite

1.1.5 temp space
Most of the work is done in a temporary directory. you need 5+GB of
space there (I don't know for sure the exact amount but 5GB should be
enough)

1.1.6 target repos
The ongit.sh script will create a target repository, with clone/*
populated with the remaining separate git repos (help, translation,s,
dictionaries and binfilter).
The core repo is initially not properly compacted, and since you
porbably want to build it, you need enough space... as a rule of
thumb, count the same amount of space you would normally reserve for a
regular bootstrap buid.

1.2 Running

Assuming that your source bootstrap repo is at /lo/libo, that the
target will be /lo/core, that the dev-tools repo is at /lo/dev-tools
and that your temporary workspace is /fast, then run:
cd /lo
time ./dev-tools/onegit/onegit.sh -f -g /lo/libo -n core -t
/fast/gittemp 2>&1 | tee onegit.log

while it is running you can look at /lo/onegit.msgs. it contains a
high-level log of what is going on.
Note that in onegit.msgs lines should start with "===" any line that
start with "***" indicate that something went wrong.

Note: the onegit.sh has been tuned to work optimally on a Intel Xeon
X3360  @ 2.83GHz (quad-core), with 8GB of memory and pretty good
disks.
For optimal result on a different machine you may need to tweak the
number of batch that ran in // and their composition. (see section
2.2.2 for gotcha).

1.3 Known issue
The onegit.sh script, as a final step try to apply a set of patches,
to fix issue related to the migration. unfortunately, since master is
a moving target these patches may fail to apply.
At this stage the conversion is done and core is usable. you can try
to fix the patches that failed to apply (and apply the rest of the
patches)
the patches are in dev-tools/onegit/patches/*

1.4 Testing
once all the patches are applied. you can start using the 'core' repo
as if it was bootstrap.


2. Reviewer Notes
==============

Reviews are of course welcomed
In order to help with the review, here are a few pointers.

2.1 Review of lo_git_rewrite

lo_git_rewrite is a fairly small C program that sit between git
fast-export and git fast-import.
Its goal to to fix the trailing spaces, tab issues and to optionally
exclude or filter-out a specific module and/or filter-out files with a
specific extension.

2.1.1 arguments

lo_git_rewrite understand to following command line argument. (note
the syntax is --foo bar and NOT --foo=bar)
all these arguments are optional

--prefix "string"
This is used to prefix output message to stderr with the specified
string. this is used in onegit.sh because more than one instance of
lo_git_rewrite is running in parallel, and this allow to link a
message to a specific lo_git_rewrite instance
the default is an empty string.

--exclude-module "module_name"
This tell lo_git_rewrite to filter-out any files whose name start with
module_name/. This is used in onegit.sh to filter-out a module from a
given repo, like binfilter or dictionaries.

--fitter-module "module_name"
This tell lo_git_rewrite to filter-out any files whose name does _not_
start with module_name/. This is used in onegit.sh to extract a given
module from an existing repo, like binfilter or dictionaries.

--exclude-suffix "string"
This tell lo_git_rewrite to exclude any files whose name end with
"string". This is used in onegit.sh to eliminated obsolete .tar.gz
file out of libs-extern-sys and libs-extern history.

--buffer-size nnn
This tell lo_git_rewrite to allocate a working buffer of nnn MB. nnn
must be a number between 10 and 1024. by default lo_git_rewrite
allocate a 30MB buffer and an additional 45MB (nnn * 1.5) temp buffer
to do file content conversion.
The reason to use this is that the buffer need to be big enough to
contain the biggest blob that can be encountered in the stream. This
is used by onegit.sh for 2 of the repo that have particularly large
blob ( libs-extern-sys and extension)

2.1.2 Operation

git fast-export create a stream of 'objects'.
Object are identified and referenced using an id in the form of  :<number>
For our purpose there are 3 types of objects: blob, commit and tag

blob come in the stream before they are used, and at that point there
is no indication what filename(s) will be associated with it.
so we need to 'clean' every blob and re-inject  two copies of the blob
in the stream so that later (when we have a filename) we can decide
which copy we need to use.
The Problem of course is that we need to assign a unique id to the new
blobs we create 'on-the-fly'.
The technique lo_git_rewrite use is to intercept all id in the form
:<number> and transform them into :<number>0 except for the extra blob
we create of the fly which get assigned :<number>1 where :<number> is
the id of the original version of the blob.
Note: we could have use :2*<number> and :2*<number>+1, but that would
have required to convert text to integer and vice-verso for each
occurrence of such an id, and for libs-core, for instance, fast
git-import report more than 1 Billion of such id in the stream (yes
Billion as in 10^9, 1,073,741,824 to be precise :-) )

commit are where all the meat is. lo_git_rewrite analyze <filemodify>
and <filedelete> entries. depending on the filename of these entry and
the fitlering rules each entry is ether modifyed it use :<number>0 or
:<number>1 depending if we want the 'sanitized' version of the blob of
not, or simply removed if that filename need to be filtered-out.
There is no attempt to eliminated 'empty' commit as a result of every
filemodify/filedelete entry being removed

tag are only modified to substitute id with  :<numbrr>0.

The code that 'sanitize' blog is essentially the same than the one
that existed in clean_spaces. the main difference is that the
'copy-on-write' optimisation that exist in clean_space has been
removed since we always want an actual copy.

lo_git_write trust git filter-export to produce a sane, predictable
stream. there is very little code/cpu expended to check that the input
stream is sane. The goal was speed and simplicity, not mis-use
robustness.

2.2 Review of onegit.sh

2.2.1 Arguments

onegit.sh --help display a short summary of the argument supported and
default value.

2.2.2 Operation

The script is organized in 3 sections.
first we check the argument and verify that the environment is sane
and that we have all we need.
then we run 4 parallel batch section that are balanced to that they
should finish at about the same time. One implicit requirement is that
the 'processing' of bootstrap need to finish before any other repo,
that is why the first task of each other batch is a 'big' repo that is
guaranteed to take significantly longer than bootstrap to finish.
finally a tag is applied on the target repos and patches are applied
to make the resulting repos 'buildable'


Norbert


More information about the LibreOffice mailing list